Audio output control

ABSTRACT

Systems and methods for audio output control are disclosed. Audio may be output via a speaker of a communal device associated with a first portion of an environment. A user may provide a user utterance indicating an intent to add another device in a second portion of the environment to the audio-output session, and/or an intent to move the audio-output session from the first device to the second device, and/or an intent to remove a device from an audio-output session. Based on this determined intent, audio-session queues may be associated and dissociated from devices and device states may be altered to effectuate the intent of the user utterance.

RELATED APPLICATIONS

This application claims priority to and is a continuation of U.S. patentapplication Ser. No. 17/713,075, filed on Apr. 4, 2022, which claimspriority to U.S. patent application Ser. No. 17/107,156, filed on Nov.30, 2020, now known as U.S. Pat. No. 11,294,622, which issued on Apr. 5,2022, which claim priority to and is a continuation of U.S. patentapplication Ser. No. 16/222,751, filed on Dec. 17, 2018, now known asU.S. Pat. No. 10,853,031, which issued on Dec. 1, 2020, which claimspriority to and is a continuation of U.S. patent application Ser. No.15/889,754, filed on Feb. 6, 2018, now known as U.S. Pat. No.10,157,042, which issued on Dec. 18, 2018, the entire contents of whichare incorporated herein by reference.

BACKGROUND

Environments may have multiple audio output devices, such as speakers.In some instances, those speakers can output the same audio. Describedherein are improvements in technology and solutions to technicalproblems that can be used to, among other things, provide alternativemeans to control audio output via multiple devices in an environment.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is set forth below with reference to theaccompanying figures. In the figures, the left-most digit(s) of areference number identifies the figure in which the reference numberfirst appears. The use of the same reference numbers in differentfigures indicates similar or identical items. The systems depicted inthe accompanying figures are not to scale and components within thefigures may be depicted not to scale with each other.

FIG. 1 illustrates a schematic diagram of an example environment foraudio output control.

FIG. 2 illustrates a schematic diagram of an example environment forcausing an additional device to output audio.

FIG. 3 illustrates a schematic diagram of an example environment formoving output of audio from a first device to a second device.

FIG. 4 illustrates a schematic diagram of an example environment forcausing one of multiple devices to cease output of audio.

FIG. 5 . illustrates a flow diagram of a process for causing anadditional device to output audio.

FIG. 6 illustrates a flow diagram of a process for moving output ofaudio from a first device to a second device.

FIG. 7 illustrates a flow diagram of a process for causing one ormultiple devices to cease output of audio.

FIG. 8 illustrates a schematic diagram of an example environment forselecting one of multiple devices as a hub device.

FIG. 9 illustrates a flow diagram of an example process for audio outputcontrol.

FIG. 10 illustrates a flow diagram of another example process for audiooutput control.

FIG. 11 illustrates a flow diagram of another example process for audiooutput control.

FIG. 12 illustrates a conceptual diagram of components of a speechprocessing system for processing audio data provided by one or moredevices.

FIG. 13 illustrates a conceptual diagram of components of a speechprocessing system associating audio output commands with multipledevices.

DETAILED DESCRIPTION

Systems and methods for audio output control are described herein. Take,for example, an environment, such as a home, that has multipleaudio-output devices. The audio-output devices may be speakers that maybe positioned around the environment at different locations. Forexample, one device may be positioned in the kitchen, another in thebedroom, and another in the basement. The devices may be associated witheach other based at least in part on, for example, the devices beingmanufactured by the same company, the devices being associated with auser profile and/or user account, the devices operate via the samespeech-processing system, and/or the devices being associated with anapplication residing on and/or accessible by a personal device, such asa mobile phone, tablet, and/or other computing device. In examples, thedevices may output differing audio such that the kitchen device outputsa first song, the bedroom device outputs a second song, and/or thebasement device outputs a third song. In other examples, a user maydesire to have multiple devices output the same audio (e.g., in atime-synchronized fashion, such that the audio data is outputted bymultiple devices within milliseconds (less than 10 ms, 20 ms, etc.) ofeach other). For example, the user may desire the kitchen device and thebedroom device to output the same song at the same time. To achieve thisfunctionality, the user may provide tactile input via the applicationassociated with the personal device, if the application includes suchfunctionality. However, the user may desire to achieve thisfunctionality and/or other functionality such as adding devices tooutput audio, moving audio to a different device, and/or ceasing outputof audio on one or more of the devices via voice commands.

To address these shortcomings, the present disclosure describes examplesystems and methods for improved audio output control. Continuing withthe example provided above, a user may be located in the kitchen and mayprovide a user utterance to the kitchen device to output audiocorresponding to a given artist. One or more microphones of the kitchendevice may capture audio corresponding to the user utterance andgenerate corresponding audio data to be sent to a remote system forprocessing. The remote system may determine intent data corresponding tothe audio data. The intent data may represent an intent to output thedesired audio. The remote system may send directive data to the kitchendevice, which may cause the kitchen device to output the requestedaudio. An audio-session queue may be identified, determined, and/orgenerated that represents a listing of audio data and/or files, such assongs, to be utilized during the audio session. The audio-session queuemay be associated with the kitchen device.

The user may then desire to have another device, such as the basementdevice, output the audio while continuing to output the audio via thekitchen device, or to move the audio to the basement device instead ofthe kitchen device, or to cease output of audio on one of the deviceswhile continuing to output the audio on one or more of the otherdevices. To do this, for example, the user may provide a second userutterance to any of the devices in the environment. Audio datacorresponding to the user utterance may be generated and sent to theremote system, which may determine the intent to add, move, and/orremove a device from the audio session. In other examples, instead ofaudio data, input data, such as from input on a mobile device executingan application thereon, may be generated and sent to the remote system.

In the example where the user requests to add a device to the audiosession, such as via a user utterance like “Alexa, play the music in thebasement too,” the state of the basement device may be associated withthe state of the device currently outputting the audio, say the kitchendevice. In this way, operations performed via the kitchen device mayalso be performed via the basement device. By so doing, the kitchendevice may act as a hub device and may cause the other associateddevices to perform similar operations as the kitchen device.Additionally, the audio-session queue associated with the kitchen devicemay be associated with both the kitchen device and the basement devicesuch that both the kitchen device and the basement device are providedaccess to the audio-session queue. In examples, before the kitchendevice and the basement device are associated with the audio-sessionqueue, the kitchen device may be dissociated from the audio-sessionqueue. Based at least in part on the audio-session queue beingassociated with the kitchen device and the basement device, both thekitchen device and the basement device may output the requested audio.Data representing the association of the audio-session queue and/or theshared state of the kitchen device and the basement device may be sentto mobile device such that the application residing on and/or accessibleby the mobile device may present the current status of audio output viathe devices.

In another example where the user requests to output audio from anotherdevice instead of the device currently output the audio, such as via auser utterance like “Alexa, move the music to the basement,” the stateof the basement device may be associated with the state of the devicecurrently outputting the audio. The state of the device currentlyoutputting the audio may then be dissociated from the state of thebasement device. Additionally, the audio-session queue associated withthe first device may be associated with the basement device anddissociated from the first device. Based at least in part on theaudio-session queue being associated with the basement device, thebasement device may output the audio while output of the audio may ceaseon the first device. Data representing the association of theaudio-session queue and/or the change of state of the basement and/orkitchen devices may be sent to the mobile device such that theapplication residing on and/or accessible by the mobile device maypresent the current status of the audio output via the device.

In another example where the user requests to cease output of audio fromone of multiple devices outputting audio, such as via a user utterancelike “Alexa, stop music in kitchen,” the state of the kitchen device maybe dissociated from the state of the other device outputting the audio.Additionally, the audio-session queue associated with the kitchen devicemay be dissociated from the kitchen device. Based at least in part onthe audio-session queue being dissociated from the kitchen device,output of the audio may cease on the kitchen device. If the kitchendevice was the hub device, one of the other devices outputting the audiomay be selected as the hub device. Data representing the dissociation ofthe audio-session queue and/or the change of state of the kitchen devicemay be sent to the mobile device such that the application residing onand/or accessible by the mobile device may present the current status ofthe audio output via the devices.

Utilizing user utterances to add devices, move audio to differentdevices, and/or cease output of audio on certain devices may beperformed without creating group identifiers when multiple devices areassociated to output audio. For example, an audio-session queue maytypically be associated with a device or a device group with a singleidentifier. Each time devices in the group change, such as by beingremoved or added, a new group and corresponding group identifier isgenerated. Generation of a new group for each group change adds latencyand leads to challenges when trying to seamlessly start and/or stopaudio output on multiple device. The techniques described herein do notgenerate a new group for each device grouping, but instead maintaindevices separately and associate device states and audio-session queuesas described herein, leading to decreased latency between user utteranceand performance of a corresponding action, and allows for audio to bestarted and stopped on multiple devices via user utterances receivedfrom some or all of the devices.

The present disclosure provides an overall understanding of theprinciples of the structure, function, manufacture, and use of thesystems and methods disclosed herein. One or more examples of thepresent disclosure are illustrated in the accompanying drawings. Thoseof ordinary skill in the art will understand that the systems andmethods specifically described herein and illustrated in theaccompanying drawings are non-limiting embodiments. The featuresillustrated or described in connection with one embodiment may becombined with the features of other embodiments, including as betweensystems and methods. Such modifications and variations are intended tobe included within the scope of the appended claims.

Additional details are described below with reference to several exampleembodiments.

FIG. 1 illustrates a schematic diagram of an example system 100 foraudio output control. The system 100 may include, for example, one ormore communal devices 102, such as voice-assistant devices and/or othercomputing devices, and one or more personal devices 104, such as amobile device. The communal devices 102 may be associated with anenvironment, such as a home or place of business. In examples, thecommunal devices 102 may each be associated with their own locationwithin an environment. By way of example, a first communal device 102may be situated in one room while a second communal device 102 may besituated in another room. Additionally, or alternatively, the personaldevice 104 may be associated with the one or more communal devices 102and/or one or more users residing in the environment. The communaldevices 102 may include various computing components, such as one ormore processors 106, one or more network interfaces 108, memory 110, oneor more microphones 112, one or more speakers 114, and/or one or moredisplays 116.

In examples, the communal devices 102 may include each of the componentsdescribed above. In these examples, the communal devices 102 may beconfigured to capture audio, such as a user utterance, via themicrophones 112 and generate corresponding audio data. This audio datamay be sent via one or more networks 118 to a remote system 120 and/or athird-party remote system 122 for processing. In other examples, thecommunal devices 102 may include only a portion of the componentsdescribed above. For example, in examples where at least one of thecommunal devices 102 is a communal speaker, the communal device 102 mayinclude the processors 106, the network interfaces 108, memory 110,and/or the speakers 114. In these examples, the communal device 102 maynot be configured to capture audio, but instead, the personal device104, or another communal device 102, may be configured to capture audioand generate corresponding audio data.

The memory 110 of the communal device(s) 102 and/or the personal device104 may include instructions that, when executed by the one or moreprocessors 106, may cause the one or more processors 106 to performcertain operations. For example, the operations may include sending theaudio data representing a user utterance to the remote system 120, suchas via the network 118. By way of example, the user utterance mayrepresent a command to control the output of audio via one or more ofthe communal devices 102. For example, audio may be output via a firstcommunal device 102. The user may desire to alter the audio of the audioon the first communal device 102, such as by stopping the audio frombeing output, and/or the user may desire to output the audio on a secondcommunal device 102 associated with the first communal device 102. Forexample, as shown in FIG. 1 , the user may be located in a firstenvironment 124, that contains the personal device 104 and a communaldevice 102. Additionally, another communal device 102 may be situated ina second environment 126. The first communal device 102 may beoutputting audio, such as a song, via the one or more speakers 114. Theuser may speak a user utterance associated with controlling the outputof the audio. For example, the user may say “Alexa, add the music to thekitchen,” and/or “Alexa, move the music to the kitchen,” and/or “Alexa,stop the music in the kitchen.”

The microphone(s) 112 of the personal device 104 and/or the communaldevice 102 may capture the user utterance and generate correspondingaudio data. The audio data may be sent, via the network 118 and usingthe network interface(s) 108, to the remote system 120 for processing.The personal device 104 and/or the communal device 102 may receive, fromthe remote system 120 and via the network 118 and network interface(s)108, directive data representing a directive for the first communaldevice 102 and/or the second communal device 102 to perform an actionbased at least in part on the user utterance. For example, the directivemay be for the audio being output by the first communal device 102 to beoutput by the second communal device 102 simultaneously, or nearsimultaneously, with the audio output by the first communal device 102.An example of this would be that the first communal device 102 isoutputting a song. Based at least in part on the user utterancerequesting that the song be output by a second communal device 102, thesong may also be output by the second communal device 102 such that bothcommunal devices 102 are outputting the same song at the same or nearlythe same time. By way of further example, the directive may be for theaudio being output by the first communal device 102 to cease beingoutput by the first communal device 102 and to be output instead by thesecond communal device 102. Sticking with the song example used herein,based at least in part on the user utterance requesting that the song bemoved from the first communal device 102 to the second communal device102, the song may be output by the second communal device 102 and thesong may cease being output by the first communal device 102. By way offurther example, the directive may be for the audio being output by thefirst communal device 102 and the second communal device 102 to ceasebeing output by one of the communal devices 102. Based at least in parton the user utterance requesting that the song cease being output by oneof multiple communal devices 102, the song may cease being output on therequested communal device 102 while the other communal device 102 maycontinue outputting the song.

Additionally, or alternatively, data indicating a state of the communaldevices 102 may be sent to the personal device 104 and may causeindicators of the states of the devices to be displayed, such as via thedisplay(s) 116, on the personal device 104. For example, an applicationmay reside on, such as in the memory 110, and/or be accessible by thepersonal device 104. The application may provide for tactile control ofthe communal devices 102 and/or may provide information about the statesof the communal devices 102. For example, the states of the communaldevices 102 may include outputting audio and not outputting audio. Thedata displayed on the personal device 104 may additionally, oralternatively, include information associated with the audio beingoutput by one or more of the communal devices 102. For example, a namingindicator associated with the audio, such as a song name, album name,artist name, and/or other identifying information may be presented onthe display 116. Additionally, or alternatively, naming indicatorsassociated with the communal devices 102 may also be displayed. Thenaming indicators may be provided by a user, such as during setup of theapplication and/or the communal devices 102. The naming indicators may,for example, provide an indication of the location of the communaldevices 102 within an environment. For example, a communal device 102located in the kitchen of a home may be labeled as and/or identified asthe “kitchen” communal device 102.

Additionally, or alternatively, the data indicating the state of thecommunal devices 102 and/or the information associated with the audiobeing output by one or more of the communal devices 102 may be utilizedby the personal device 104 and/or the communal device 102 to respond toa user query. For example, the user may provide a user utterancerepresenting a request for information about the state of one or more ofthe communal devices 102 and/or for information about the audio beingoutput. For example, a user may say “Alexa, what song is being played inthe kitchen?” The data may indicate that the state of the kitchen deviceis outputting audio corresponding to a given audio-session queue, andthe identify of the current song being output. A text-to-speechcomponent 142 of the remote system 120 may generate audio datarepresenting a response to the user utterance to be output by thespeaker(s) 114 of the personal device 104 and/or the communal device102.

The remote system 120 of the system 100 may include one or morecomputing components, such as, for example, one or more processors 128,one or more network interfaces 130, and memory 132. The memory 132 ofthe remote system 120 may include one or more components, such as, forexample, a user profile/account component 134, an automatic speechrecognition (ASR) component 136, a natural language understanding (NLU)component 138, a media-grouping state controller 140, a text-to-speech(TTS) component 142, one or more application programming interfaces(APIs) 144, a contextual information database 146, and/or anaudio-session queue storage/access component 148. Each of thesecomponents will be described in detail below.

The user profiles/accounts component 134 may be configured to identify,determine, and/or generate associations between users, user profiles,user accounts, and/or devices. For example, one or more associationsbetween personal devices 104, communal devices 102, environments,networks 118, users, user profiles, and/or user accounts may beidentified, determined, and/or generated by the user profile/accountcomponents 134. The user profile/account component 134 may additionallystore information indicating one or more applications accessible to thepersonal device 104 and/or the communal devices 102. It should beunderstood that the personal device 104 may be associated with one ormore other personal devices 104, one or more of the communal devices102, one or more environments, one or more applications stored on and/oraccessible by the personal device 104, and/or one or more users. Itshould also be understood that that a user account may be associatedwith one or more than one user profile. For example, a given personaldevice 104 may be associated with a user account and/or user profilethat is also associated with the communal devices 102 associated with anenvironment. The personal device 104, the communal device 102, the userprofile, and/or the user account may be associated with one or moreapplications, which may have their own user profiles and/or useraccounts, that provide access to audio data, such as songs.

The ASR component 136 may be configured to receive audio data, which mayrepresent human speech such as user utterances, and generate text datacorresponding to the audio data. The text data may include wordscorresponding to the human speech. The NLU component 138 may beconfigured to determine one or more intents associated with the humanspeech based at least in part on the text data. The ASR component 136and the NLU component 138 are described in more detail below withrespect to FIG. 12 . For purposes of illustration, the ASR component 136and the NLU component 138 may be utilized to determine one or moreintents to control audio output on one or more communal devices 102.

For example, a user may provide a user utterance to add a communaldevice 102 to an audio output session, to move an audio output sessionto another communal device 102, and/or to cease output of audio on oneof multiple communal devices 102. Audio data corresponding to the userutterance may be received by the remote system 120. The ASR component136 may process the audio data and generate corresponding text data. TheNLU component 138 may utilize the text data to determine intent datarepresenting an intent of the user to, in these examples, add, move, orremove communal devices 102 from an audio session.

The media-grouping state controller 140 may be configured to control thestates of communal devices 102. For example, each communal device 102may be associated with a state. The state of the communal device 102 maybe, for example, an audio-output state where the communal device 102 iscurrently outputting audio and/or an inactive state where the communaldevice 102 is not currently outputting audio. Additionally, whenmultiple communal devices 102 are outputting different audio, such aswhen a first communal device 102 is outputting a first song and a secondcommunal device 102 is outputting a second song, each of the communaldevices 102 may be associated with a different audio-output state. Themedia-grouping state controller 140 may be further configured toidentify and/or determine the state of one or more of the communaldevices 102. Based at least in part on receiving a user utterance tocontrol audio output on the communal devices 102, the media-groupingstate controller 140 may cause control data to be spent to one or moreof the communal devices 102 to change the state of those communaldevices 102. Data from the third-party remote system 122 mayadditionally, or alternatively, inform the identification and/ordetermination of the state of communal devices 102.

For example, a first communal device 102 may be currently outputtingaudio associated with an audio-session queue. Based at least in part onthe first communal device 102 currently outputting audio, themedia-grouping state controller 140 may identify and/or determine thatthe first communal device 102 is associated with a first audio-outputstate. A second communal device 102 that is associated with the firstcommunal device 102 may not be currently outputting audio. Based atleast in part on the second communal device 102 not outputting audio,the media-grouping state controller 140 may identify and/or determinethat the second communal device 102 is associated with an inactivestate. The media-grouping state controller 140 may also receive datafrom, for example, the NLU component 138 indicating that the userdesires to act with respect to output of the audio. For example, the NLUcomponent 138 may determine that the user utterance corresponds to anintent to output the audio on the second communal device 102 in additionto the first communal device 102, or otherwise to add the secondcommunal device 102 to the audio session. The media-grouping statecontroller 140 may, based at least in part on information provided bythe NLU component 138, cause the inactive state of the second communaldevice 102 to change to the audio-output state of the first communaldevice 102. In this example, actions taken by the first communal device102 may also be taken by the second communal device 102, such as, forexample, outputting the audio, accessing audio-session queues, and/orcontrolling audio output volumes.

By way of further example, the NLU component 138 may determine that theuser utterance corresponds to an intent to output the audio on thesecond communal device 102 instead of the first communal device 102, orotherwise to move the audio session from the first communal device 102to the second communal device. The media-grouping state controller 140may, based at least in part on information provided by the NLU component138, cause the inactive state of the second communal device 102 tochange to the audio-output state of the first communal device 102.Additionally, the media-grouping state controller 140 may cause theaudio-output state of the first communal device 102 to change to aninactive state, which may be the same state as the second communaldevice 102 before the user utterance was received, or to a differentinactive state. In this example, actions taken by the first communaldevice 102 may not be taken by the second communal device 102, and/oractions taken by the second communal device 102 may not be taken by thefirst communal device 102.

By way of further example, the NLU component 138 may determine that theuser utterance corresponds to an intent to cease output of the audio onthe first communal device 102 but to continue outputting the audio onthe second communal device 102, or otherwise to remove the firstcommunal device 102 from the audio session. The media-grouping statecontroller 140 may, based at least in part on information provided bythe NLU component 138, cause the audio-output state of the firstcommunal device 102 to change to an inactive state while maintaining theaudio-output state of the second communal device 102. In this example,actions taken by the first communal device 102 may not be taken by thesecond communal device 102, and/or actions taken by the second communaldevice 102 may not be taken by the first communal device 102.

The media-grouping state controller 140 may also be configured to causea communal device 102 of multiple associated communal devices 102 to actas a hub device. The hub device may control the other communal devices102 not designated as a hub device. In these examples, data may flowfrom the non-hub devices to the hub device, which may communicate onbehalf of the hub device and the non-hub devices with the remote system120 and/or the third-party remote system 122. Selection of the hubdevice is described in more detail with respect to FIG. 8 , below.

In examples, the media-grouping state controller may be a component of adevice management component, which is described in detail with respectto FIG. 13 , below.

The TTS component 142 may be configured to generate audio data to beutilized by one or more of the communal devices 102 and/or the personaldevices 104 to output audio in the form of synthesized or prerecordedspeech. For example, a user may provide an audible query to the personaldevice 104 and/or the communal device 102. The microphones 112 of thepersonal device 104 and/or the communal device 102 may capture the userutterance and generate corresponding audio data that is sent to theremote system. The ASR component 136 may generate corresponding textdata and the NLU component 138 may determine, using the text data,intent data representing an intent by the user to acquire information,such as information associated with the audio being output by thecommunal device 102. One or more speechlets associated with providingthe information may receive the intent data and may determine a responseto provide to the user. The TTS component 142 may take text datacorresponding to the response and may generate audio data correspondingto the text data. The audio data may be sent to the personal device 104and/or one or more of the communal devices 102 for output of audiocorresponding to the audio data.

By way of example, the user may be near a first communal device 102located, for example, in a bedroom, and may say “Alexa, what's playingin the kitchen?” Corresponding audio data may be sent to the remotesystem 120. Text data corresponding to the request may be generated bythe ASR component 136 and the NLU component 138 may determine intentdata representing the intent of determining identifying informationassociated with audio being output by the communal device 102 located inand/or associated with the kitchen. Text data representing theidentifying information may be generated and/or identified and may beutilized by the TTS component 142 to generate audio data representing aresponse, which may include the identifying information. The audio datamay be sent to the communal device 102 that generated the audio data andthe speakers 114 of the communal device 102 may output audiocorresponding to the audio data in response to the user utterance.

The APIs 144 may include one or more APIs configured to supportcommunication of data and performance of operations between the personaldevice 104, the communal devices 102, the remote system 120, and thethird-party remote system 122. For example, communication of eventsassociated with control of audio output on the communal devices 102 maybe performed via the APIs 144. In situations where the user utterancecorresponds to an intent to add a communal device 102 to the audiosession, directive data associated with this intent may be sent, via anAPI, to the third-party remote system 122. The directive data mayindicate the communal device 102 to be added to the audio session. Thethird-party remote system 122 may associate the communal device 102currently outputting audio with the added communal device 102. Insituations where the user utterance corresponds to an intent to move theoutput of audio from one communal device 102 to another communal device102, directive data associated with this intent may be sent, via an API144, to the third-party remote system 122. The directive data mayindicate the communal device 102 be added to the audio session and thecommunal device 102 to be removed from the audio session. Thethird-party remote system 122 may associate the communal device 102currently outputting audio with the added communal device 102 and maydissociate the first communal device 102 from the audio session. Insituations where the user utterance corresponds to an intent to remove acommunal device from an audio session, directive data associated withthis intent may be sent, via an API 144, to the third-party remotesystem 122. The directive data may indicate the communal device 102 tobe removed from the audio session. The third-party remote system 122 maydissociate the requested communal device 102 from the audio session.

The APIs 144 may be provided by the third-party remote system 122 and/orthe APIs 144 may be identified, determined, and/or generated by theremote system 120. The APIs 144 may be utilized for multiple thirdparties providing access to audio files, and/or the APIs 144 may bespecific to the third party providing access to the audio files, and/orthe third party that manufacturers one or more of the communal devices102, and/or the third party that develops the audio-output applicationstored on and/or accessed by the personal device 104.

The contextual information database 146 may be configured to identify,determine, and/or generate contextual information associated with theuser profiles, the user accounts, the personal device 104, the communaldevices 102, and/or audio data representing user utterances to controlaudio output. For example, the contextual information may includeinformation about which device audio corresponding to user utterances iscaptured by, which communal devices 102 are currently outputting audioand what audio is currently being output, previous audio-outputrequests, amounts of time between requests, the time of day a request ismade, and/or user-specific behavior associated with requests.

For example, the contextual information about which device audiocorresponding to a user utterance is captured may be utilized to informa determination, such as by the NLU component 138, of what audio isassociated with an intent to add, move, and/or remove a communal device102 from an audio session. For example, a user utterance may be “Alexa,play this in the kitchen.” In this example, the word “this” is ananaphora. Based at least in part on contextual information identifyingaudio being output by a communal device 102 from which the userutterance was captured, the remote system 120 may determine that “this”corresponds to the audio currently being output by the communal device102.

By way for further example, a user utterance may be “Alexa, play themusic here.” In this example, the word “here” is an anaphora. Based atleast in part on contextual information identifying the communal device102 that captured the user utterance, the remote system 120 maydetermine that “here” corresponds to the communal device 102 thatcaptured the audio and/or that generated the audio data. Based at leastin part on this determination, the state of the communal device 102 maybe transitioned to the state of at least one other communal device 102that is outputting “the music,” and the audio session associated withthe at least one other communal device 102 may be associated with thecommunal device 102 that captured the user utterance. Additionally, oralternatively, the communal device 102 that was outputting the audio maybe dissociated from the audio session and/or the state of the communaldevice 102 may be transitioned to an inactive or other state. Thisexample may be utilized when the user utterance signifies an intent tomove the audio session to another communal device 102 as opposed toadding that communal device 102 to the audio session. In other examples,a user utterance of “Alexa, play the music here also” may signify anintent to add a communal device 102 instead of moving the audio sessionfrom one communal device 102 to another communal device 102.

By way of further example, contextual information indicating a timing ofa user utterance with respect to a previous user utterance and/or a timeof day of the user utterance may be utilized to disambiguate intents andcontrol audio content on multiple communal devices 102. For example, ifan audio session is being output by multiple communal devices 102, auser utterance of “Alexa, stop,” may result in each of the communaldevices 102 ceasing output of the audio data. A subsequent userutterance received within a threshold amount of time from the “stop”request may result in the requested action being performed by each ofthe communal devices 102. In other examples, a user utterance receivedafter a threshold amount of time, or received, for example, the nextday, may result in only the communal device 102 that captured the userutterance performing the action. In this way, an anaphora of a userutterance may be interpreted to accurately determine how to controlaudio output on multiple associated communal devices 102, and one ormore intents may be inferred.

The audio-session queue storage/access component 148 may be configuredto store and/or access an audio-session queue and/or informationassociated with an audio-session queue. For example, an audio-sessionqueue may be identified, determined, and/or generated based at least inpart on a user's request to output audio. For example, a user request tooutput songs from the Moana soundtrack may result in the identification,determination, and/or generation of an audio-session queue correspondingto the songs on the Moana soundtrack. This audio-session queue may beassociated with the communal device 102 from which the request to outputaudio was received. When subsequent user utterances are received thatrepresent requests to add communal devices 102, move audio sessions tocommunal devices 102, and/or remove communal devices 102 from audiosessions, the audio-session queue storage/access component 148 mayassociate audio with the communal devices 102 to effectuate the intendedaudio output by the communal devices 102.

As used herein, a processor, such as processor(s) 106 and 128, mayinclude multiple processors and/or a processor having multiple cores.Further, the processors may comprise one or more cores of differenttypes. For example, the processors may include application processorunits, graphic processing units, and so forth. In one implementation,the processor may comprise a microcontroller and/or a microprocessor.The processor(s) 106 and 128 may include a graphics processing unit(GPU), a microprocessor, a digital signal processor or other processingunits or components known in the art. Alternatively, or in addition, thefunctionally described herein can be performed, at least in part, by oneor more hardware logic components. For example, and without limitation,illustrative types of hardware logic components that can be used includefield-programmable gate arrays (FPGAs), application-specific integratedcircuits (ASICs), application-specific standard products (ASSPs),system-on-a-chip systems (SOCs), complex programmable logic devices(CPLDs), etc. Additionally, each of the processor(s) 106 and 128 maypossess its own local memory, which also may store program components,program data, and/or one or more operating systems.

The memory 110 and 132 may include volatile and nonvolatile memory,removable and non-removable media implemented in any method ortechnology for storage of information, such as computer-readableinstructions, data structures, program component, or other data. Suchmemory 110 and 132 includes, but is not limited to, RAM, ROM, EEPROM,flash memory or other memory technology, CD-ROM, digital versatile disks(DVD) or other optical storage, magnetic cassettes, magnetic tape,magnetic disk storage or other magnetic storage devices, RAID storagesystems, or any other medium which can be used to store the desiredinformation and which can be accessed by a computing device. The memory110 and 132 may be implemented as computer-readable storage media(“CRSM”), which may be any available physical media accessible by theprocessor(s) 106 and 128 to execute instructions stored on the memory110 and 132. In one basic implementation, CRSM may include random accessmemory (“RAM”) and Flash memory. In other implementations, CRSM mayinclude, but is not limited to, read-only memory (“ROM”), electricallyerasable programmable read-only memory (“EEPROM”), or any other tangiblemedium which can be used to store the desired information and which canbe accessed by the processor(s).

Further, functional components may be stored in the respective memories,or the same functionality may alternatively be implemented in hardware,firmware, application specific integrated circuits, field programmablegate arrays, or as a system on a chip (SoC). In addition, while notillustrated, each respective memory, such as memory 110 and 132,discussed herein may include at least one operating system (OS)component that is configured to manage hardware resource devices such asthe network interface(s), the I/O devices of the respective apparatuses,and so forth, and provide various services to applications or componentsexecuting on the processors. Such OS component may implement a variantof the FreeBSD operating system as promulgated by the FreeBSD Project;other UNIX or UNIX-like variants; a variation of the Linux operatingsystem as promulgated by Linus Torvalds; the FireOS operating systemfrom Amazon.com Inc. of Seattle, Washington, USA; the Windows operatingsystem from Microsoft Corporation of Redmond, Washington, USA; LynxOS aspromulgated by Lynx Software Technologies, Inc. of San Jose, California;Operating System Embedded (Enea OSE) as promulgated by ENEA AB ofSweden; and so forth.

The network interface(s) 108 and 130 may enable communications betweenthe components and/or devices shown in system 100 and/or with one ormore other remote systems, as well as other networked devices. Suchnetwork interface(s) 108 and 130 may include one or more networkinterface controllers (NICs) or other types of transceiver devices tosend and receive communications over the network 118.

For instance, each of the network interface(s) 108 and 130 may include apersonal area network (PAN) component to enable communications over oneor more short-range wireless communication channels. For instance, thePAN component may enable communications compliant with at least one ofthe following standards IEEE 802.15.4 (ZigBee), IEEE 802.15.1(Bluetooth), IEEE 802.11 (WiFi), or any other PAN communicationprotocol. Furthermore, each of the network interface(s) 108 and 130 mayinclude a wide area network (WAN) component to enable communication overa wide area network.

In some instances, the remote system 120 may be local to an environmentassociated the personal device 104 and/or one or more of the communaldevices. For instance, the remote system 120 may be located within thepersonal device 104 and/or one or more of the communal devices 102. Insome instances, some or all of the functionality of the remote system120 may be performed by one or more of the personal device 104 and/orone or more of the communal devices 102.

FIG. 2 illustrates a schematic diagram of an example environment 200 forcausing an additional device to output audio. FIG. 2 depicts aprogression, from top to bottom, of the output of audio via multipledevices in the environment 200. The environment 200 may include a firstcommunal device 202 situated in a first portion 204 of the environment200 and a second communal device 206 situated in a second portion 208 ofthe environment 200. In the example of FIG. 2 , a user may be situatedin the first portion 204 of the environment 200. The user may speak auser utterance. In this example, the user utterance is “Add music here”or “Play music here too.” One or more microphones of the first communaldevice 202 may capture audio corresponding to the user utterance and maygenerate corresponding audio data. The audio data may be sent to aremote system for speech processing. For example, the remote system mayperform automatic speech recognition and natural language understandingtechniques to determine an intent associated with the user utterance.The use of automatic speech recognition and natural languageunderstanding techniques are described in more detail with respect toFIG. 12 below. In the example of FIG. 2 , the remote system maydetermine that the user utterance corresponds to an intent to outputaudio.

In this example, the user utterance includes the anaphora “here.” Basedat least in part on contextual information indicating that the userutterance was captured by the first communal device 202, “here” may beassociated with an intent to output the “music” on the first communaldevice 202. Based at least in part on determining that the userutterance corresponds to an intent to output music being output by anassociated communal device, such as the second communal device 206, theremote system may identify a source device, which may be the communaldevice currently outputting the audio, which, in this example, is thesecond communal device 206. A state controller of the remote system maychange the state of the first communal device 202 to be the same orsimilar to the state of the second communal device 206 based at least inpart on determining that the user utterance corresponds to the intent tooutput music on the first communal device 202 along with outputting themusic on the second communal device 206. Additionally, the audio-sessionqueue associated with the second communal device 206 may be associatedwith the first communal device 202. In this way, in response to the userutterance, the first communal device 202 and the second communal device206 may output the same audio in both the first portion 204 and thesecond portion 206 of the environment 200. Data indicating theaudio-output status of the first communal device 202 and the secondcommunal device 206 may be sent to the communal devices, a personaldevice associated with the communal devices, and/or a third-party remotesystem. This data may be utilized to provide a visual and/or audibleindication of the audio-output status of the communal devices, such asin response to a query for status information from the user.

FIG. 3 illustrates a schematic diagram of an example environment 300 formoving output of audio from a first device to a second device. FIG. 3depicts a progression, from top to bottom, of the output of audio viamultiple devices in the environment 300. The environment 300 may includea first communal device 302 situated in a first portion 304 of theenvironment 300 and a second communal device 306 situated in a secondportion 308 of the environment 300. In the example of FIG. 3 , a usermay be situated in the first portion 304 of the environment 300. Theuser may speak a user utterance. In this example, the user utterance is“Move this to kitchen” or “Play this in kitchen.” One or moremicrophones of the first communal device 302 may capture audiocorresponding to the user utterance and may generate corresponding audiodata. The audio data may be sent to a remote system for speechprocessing. For example, the remote system may perform automatic speechrecognition and natural language understanding techniques to determinean intent associated with the user utterance. The use of automaticspeech recognition and natural language understanding techniques aredescribed in more detail with respect to FIG. 12 below. In the exampleof FIG. 3 , the remote system may determine that the user utterancecorresponds to an intent to output audio.

In this example, the user utterance includes the anaphora “this.” Basedat least in part on contextual information indicating at least one ofthe first communal device 302 or the second communal device 306 iscurrently outputting audio, “this” may be associated with an intent tooutput the audio currently being output on the second communal device306. Additionally, at least a portion of the intent may be inferred fromthe user utterance. For example, for the user utterance of “play this inkitchen,” an intent associated with outputting audio via the communaldevice associated with the kitchen and ceasing output on the communaldevices that captured the user utterance may be determined. However, ininstances where the user utterance includes an indication that theintent is to add a communal device instead of moving the audio sessionto a different communal device, such as when the words “too,” “also,”“add,” and/or “as well” are used, the output may continue to be outputvia the communal device that captured the user utterance.

Based at least in part on determining that the user utterancecorresponds to an intent to move output of the audio from the firstcommunal device 302 to the second communal device 306, the remote systemmay identify a source device, which may be the first communal device 302in this example. A state controller of the remote system may change thestate of the second communal device 306 to be the same as or similar tothe state of the first communal device 302 based at least in part ondetermining that the user utterance corresponds to the intent to outputmusic on the second communal device 306 instead of outputting the musicon the first communal device 302. Additionally, the audio-session queueassociated with the first communal device 302 may be associated with thesecond communal device 306, and may be dissociated with the firstcommunal device 302. In this way, in response to the user utterance, thesecond communal device 306 may output the audio instead of the firstcommunal device 302. Data indicating the audio-output status of thefirst communal device 302 and the second communal device 306 may be sentto the communal devices, a personal device associated with the communaldevices, and/or a third-party remote system. This data may be utilizedto provide a visual and/or audible indication of the audio-output statusof the communal devices, such as in response to a query for statusinformation from the user.

FIG. 4 illustrates a schematic diagram of an example environment 400 forcausing one of multiple devices to cease output of audio. FIG. 4 depictsa progression, from top to bottom, of the output of audio via multipledevices in the environment 400. The environment 400 may include a firstcommunal device 402 situated in a first portion 404 of the environment400 and a second communal device 406 situated in a second portion 408 ofthe environment 400. In the example of FIG. 4 , a user may be situatedin the first portion 404 of the environment 400. The user may speak auser utterance. In this example, the user utterance is “stop,” or “stopmusic here.” One or more microphones of the first communal device 402may capture audio corresponding to the user utterance and may generatecorresponding audio data. The audio data may be sent to a remote systemfor speech processing. For example, the remote system may performautomatic speech recognition and natural language understandingtechniques to determine an intent associated with the user utterance.The use of automatic speech recognition and natural languageunderstanding techniques are described in more detail with respect toFIG. 12 below. In the example of FIG. 4 , the remote system maydetermine that the user utterance corresponds to an intent to alter theoutput of audio.

In this example, the user utterance includes the anaphora “here.” Basedat least in part on contextual information indicating that the userutterance was captured by the first communal device 402, “here” may beassociated with an intent to stop output of audio via the first communaldevice 402. Additionally, at least a portion of the intent may beinferred from the user utterance. For example, as shown in FIG. 4 , boththe first communal device 402 and the second communal device 406 areoutputting audio. When the user utterance corresponds to an intent tocease output of audio, the user utterance may be further utilized todetermine whether to cease output of audio on all communal devicescurrently outputting audio or just a portion of the communal devices. Byway of example, the user utterance of “stop” or “stop music,” maycorrespond to an intent to cease output of audio on all communal devicescurrently outputting audio in the environment 400. In other examples,the user utterance of “stop music here” or “stop here” or “stop music inthe kitchen,” may correspond to an intent to cease output of audio on aportion of the communal devices while continuing to output audio onother communal devices.

Based at least in part on determining that the user utterancecorresponds to an intent to cease output of audio on the first communaldevice 402, the remote system may identify the first communal device 402and may cause the first communal device 402 to cease output of theaudio. A state controller of the remote system may change the state ofthe first communal device 402 to an inactive state or different statethan the second communal device 406. Additionally, the audio-sessionqueue may be dissociated from the first communal device 402. In thisway, in response to the user utterance, the second communal device 406may continue outputting the audio while the the first communal device402 may cease output of the audio. Data indicating the audio-outputstatus of the first communal device 402 and the second communal device406 may be sent to the communal devices, a personal device associatedwith the communal devices, and/or a third-party remote system. This datamay be utilized to provide a visual and/or audible indication of theaudio-output status of the communal devices, such as in response to aquery for status information from the user. By way of example, even whenthe first communal device 402 is not associated with the state of thesecond communal device 406 and/or an audio-session queue, the firstcommunal device 402 may output a response to a request for statusinformation associated with the second communal device 406.

FIGS. 5-7 illustrate various processes for audio content output control.The processes described herein are illustrated as collections of blocksin logical flow diagrams, which represent a sequence of operations, someor all of which may be implemented in hardware, software or acombination thereof. In the context of software, the blocks mayrepresent computer-executable instructions stored on one or morecomputer-readable media that, when executed by one or more processors,program the processors to perform the recited operations. Generally,computer-executable instructions include routines, programs, objects,components, data structures and the like that perform particularfunctions or implement particular data types. The order in which theblocks are described should not be construed as a limitation, unlessspecifically noted. Any number of the described blocks may be combinedin any order and/or in parallel to implement the process, or alternativeprocesses, and not all of the blocks need be executed. For discussionpurposes, the processes are described with reference to theenvironments, architectures and systems described in the examplesherein, such as, for example those described with respect to FIGS. 1-4,8, 12, and 13 , although the processes may be implemented in a widevariety of other environments, architectures and systems.

FIG. 5 . illustrates a flow diagram of a process for causing anadditional device to output audio. The order in which the operations orsteps are described is not intended to be construed as a limitation, andany number of the described operations may be combined in any orderand/or in parallel to implement process 500.

At block 502, process 500 may include receiving, from a first device, auser command. In examples, the user command may be a user utterance, andaudio data representing the user utterance may be received from thefirst device. In examples, the first device may be a communal device,such as the communal devices 102 described above with respect to FIG. 1. One or more microphones of the first device may capture audiorepresenting the user utterance and may generate corresponding audiodata. That audio data may be sent from the first device to a remotesystem, for example, and may be received at the remote system. Inexamples, the audio data may be received via an automatic speechrecognition component, such as the automatic speech recognitioncomponent 136 described with respect to FIG. 12 below. The first devicemay be situated in a first portion of an environment and may beassociated with one or more other devices situated in other portions ofthe environment. In other examples, the user command may be an inputother than an audible input, such as a touch input and/or an instructionsent from another device, such as a personal device.

At block 504, the process 500 may include determining an intent tooutput audio via the first device and a second device. For example,automatic speech recognition techniques may be utilized to generate textdata corresponding to the audio data. The text data may represent wordsdetermined from the audio data. Natural language understandingtechniques may be utilized to generate intent data that may represent anintent determined from the text data. In examples, a natural languageunderstanding component of the remote system, such as the naturallanguage understanding component 138 described with respect to FIG. 12below, may be utilized. In this example, the user utterance may be, forexample, “add the music to the kitchen.” In this example, the firstdevice may be outputting audio corresponding to music. Based at least inpart on the intent data, it may be determined that the user utterancecorresponds to an “add” intent, which may represent an intent to outputaudio on a second device in addition to continuing to output audio onthe first device.

At block 506, the process 500 may include determining a source deviceassociated with the audio. For example, an audio-session queue may beassociated with a device that is currently outputting audio. Asdescribed in this example, the first device may be currently outputtingaudio corresponding to music. An audio-session queue that indicates aqueue of songs to be output by the first device may be associated withthe first device based at least in part on the first device currentlyoutputting the audio. In examples, an audio-session queue storage/accesscomponent, such as the audio-session queue storage/access component 134described with respect to FIG. 1 , may be utilized to determine thesource device. It should be understood that while one device isdetermined to be the source device in this example, multiple devices maybe determined to be source devices based at least in part on audiocurrently being output by the devices.

At block 508, the process 500 may include identifying the audio-sessionqueue from the source device. As described above, the audio-sessionqueue may indicate a queue of songs to be output by the source device.In some examples, the audio-session queue is static, such as insituations where output of the queued songs is from an album or playlistof fixed songs. In other examples, the audio-session queue may bedynamic and may change based at least in part on how a user interactswith the audio being output. For example, a user's indication that he orshe likes the song being output may cause the audio-session queue tochange such that similar songs to the liked song are added to the queue,or moved up in the queue, while dissimilar songs are removed from thequeue, or moved down in the queue. In examples, an audio-session queuestorage/access component, such as the audio-session queue storage/accesscomponent 134 described with respect to FIG. 1 , may be utilized toidentify the audio-session queue.

At block 510, the process 500 may include determining one or more targetdevices. Using the example provided above, the user utterance included“add the music to the kitchen.” In this example, the remote system maydetermine, along with the intent to add music to a device, that thedevice to which the music is to be added is associated with the word“kitchen.” The word “kitchen” may correspond to an identifier of adevice associated with the environment. For example, during setup of adevice, the user may be queried to provide a naming indicator for thedevice, which may, in this example, be a naming indicator associatedwith a location within the environment that the device is situated.Additionally, or alternatively, the identifier of the device may belearned over time, such as through analysis of user utterancesindicating that the device is located in a given portion of theenvironment. It should be noted that while location-based identifiersare used herein, they are used by way of illustration only. Theidentifiers of devices may be any identifier, such as “Device 1,” “1,”or any other word, number, or combination thereof. The devices may eachhave their own device number or alpha-numeric identifier that may beutilized as the identifier of the device for purposes of sending andreceiving data. Using the example provided with respect to FIG. 5 , thetarget device may be determined to be the “kitchen” device.

In addition to the device identifier provided explicitly from the userutterance, one or more inferences may be made as to the target devices.For example, when the intent corresponds to an intent to add theaudio-session queue to a second device while continuing to output theaudio by a first device, the first device may also be determined to be atarget device. By way of further example, the user utterance may includean anaphora, such as the use of the word “here” in the utterance “addthe music here.” At least one of the target devices may be determined tobe the device that captured the audio representing the user utterancebased at least in part on the anaphora of “here.”

At block 512, the process 500 may include matching a state of the targetdevice with the state of the source device. For example, each device maybe associated with a state. The state of a device may be, for example,an audio-output state where the device is currently outputting audioand/or an inactive state where the device is not currently outputtingaudio. Additionally, when multiple devices are outputting differentaudio, such as when a first device is outputting a first song and asecond device is outputting a second song, each of the devices may beassociated with a different audio-output state. A state controller, suchas the media-grouping state controller 140 described with respect toFIG. 1 , may be configured to identify and/or determine the state of oneor more of the devices. Based at least in part on receiving a userutterance to control audio output on the devices, the state controllermay cause control data to be sent to one or more of the devices tochange the state of those devices.

For example, a first device may be currently outputting audio associatedwith an audio-session queue. Based at least in part on the first devicecurrently outputting audio, the state controller may identify and/ordetermine that the first device is associated with a first audio-outputstate. A second device that is associated with the first device may notbe currently outputting audio. Based at least in part on the seconddevice not outputting audio, the state controller may identify and/ordetermine that the second device is associated with an inactive state.The state controller may also receive data from, for example, othercomponents of the remote system indicating that the user desires to actwith respect to output of the audio. For example, it may be determinedthat the user utterance corresponds to an intent to output the audio onthe second device in addition to the first device, or otherwise to addthe second device to the audio session. The state controller may, basedat least in part on information provided by the other components, causethe inactive state of the second device to change to the audio-outputstate of the first device. In this example, actions taken by the firstdevice may also be taken by the second device, such as, for example,outputting the audio, accessing audio-session queues, and/or controllingaudio output volumes.

The state controller may also be configured to cause a device ofmultiple associated devices to act as a hub device. The hub device maycontrol the other devices not designated as a hub device. In theseexamples, data may flow from the non-hub devices to the hub device,which may communicate on behalf of the hub device and the non-hubdevices with the remote system and/or a third-party remote system.Selection of the hub device is described in more detail with respect toFIG. 8 , below.

At block 514, the process 500 may include dissociating the source devicefrom the audio-session queue. For example, when the user utterancecorresponds to an intent to output audio on a first device that iscurrently outputting audio and on a second device that is not currentlyoutputting the audio, the first device and the second device may bedetermined to be target devices, as described above with respect toblock 510. The audio-session queue may be dissociated from the sourcedevice, and then at block 516, the audio-session queue may be associatedwith the first device and the second device as the determined targetdevices. In examples, associating and/or dissociating audio-sessionqueues may be perform by an audio-session queue storage/accesscomponent, such as the audio-session queue storage/access component 148described with respect to FIG. 1 . In examples, associating and/ordissociating audio-session queues may be performed after confirmatorydata has been received from the third party associated with theaudio-session queue. The confirmatory data may indicate that theintended retargeting of the audio-session queue from a first device to afirst device and a second device was successful.

At block 518, the process 500 may include causing output of audiorepresenting a response to the user utterance. For example, if thesecond device was successfully added such that the audio-session queueis associated with the first device and the second device, audio may beoutput indicating that the command provided by the user was successfullycarried out. Output of audio may be performed via the one or morespeakers 114 of a communal device 102, for example.

FIG. 6 illustrates a flow diagram of a process 600 for moving output ofaudio from a first device to a second device. The order in which theoperations or steps are described is not intended to be construed as alimitation, and any number of the described operations may be combinedin any order and/or in parallel to implement process 600.

At block 602, process 600 may include receiving, from a first device,audio data representing a user utterance. In examples, the first devicemay be a communal device, such as the communal devices 102 describedabove with respect to FIG. 1 . One or more microphones of the firstdevice may capture audio representing the user utterance and maygenerate corresponding audio data. That audio data may be sent from thefirst device to a remote system, for example, and may be received at theremote system. In examples, the audio data may be received via anautomatic speech recognition component, such as the automatic speechrecognition component 136 described with respect to FIG. 12 below. Thefirst device may be situated in a first portion of an environment andmay be associated with one or more other devices situated in otherportions of the environment.

At block 604, the process 600 may include determining an intent tooutput audio via a second device and not via the first device. Forexample, automatic speech recognition techniques may be utilized togenerate text data corresponding to the audio data. The text data mayrepresent words determined from the audio data. Natural languageunderstanding techniques may be utilized to generate intent data thatmay represent an intent determined from the text data. In examples, anatural language understanding component of the remote system, such asthe natural language understanding component 138 described with respectto FIG. 12 below, may be utilized. In this example, the user utterancemay be, for example, “move the music to the kitchen.” In this example,the first device may be outputting audio corresponding to music. Basedat least in part on the intent data, it may be determined that the userutterance corresponds to a “move” intent, which may represent an intentto output audio on a second device and to cease outputting audio on thefirst device.

At block 606, the process 600 may include determining a source deviceassociated with the audio. For example, an audio-session queue may beassociated with a device that is currently outputting audio. Asdescribed in this example, the first device may be currently outputtingaudio corresponding to music. An audio-session queue that indicates aqueue of songs to be output by the first device may be associated withthe first device based at least in part on the first device currentlyoutputting the audio. In examples, an audio-session queue storage/accesscomponent, such as the audio-session queue storage/access component 134described with respect to FIG. 1 , may be utilized to determine thesource device. It should be understood that while one device isdetermined to be the source device in this example, multiple devices maybe determined to be source devices based at least in part on audiocurrently being output by the devices.

At block 608, the process 600 may include identifying the audio-sessionqueue from the source device. As described above, the audio-sessionqueue may indicate a queue of songs to be output by the source device.In some examples, the audio-session queue is static, such as insituations where output of the queued songs is from an album of fixedsongs. In other examples, the audio-session queue may be dynamic and maychange based at least in part on how a user interacts with the audiobeing output. For example, a user's indication that he or she likes thesong being output may cause the audio-session queue to change such thatsimilar songs to the liked song are added to the queue, or moved up inthe queue, while dissimilar songs are removed from the queue, or moveddown in the queue. In examples, an audio-session queue storage/accesscomponent, such as the audio-session queue storage/access component 134described with respect to FIG. 1 , may be utilized to identify theaudio-session queue.

At block 610, the process 600 may include determining one or more targetdevices. Using the example provided above, the user utterance included“move the music to the kitchen.” In this example, the remote system maydetermine, along with the intent to move music to a device, that thedevice to which the music is to be moved is associated with the word“kitchen.” The word “kitchen” may correspond to an identifier of adevice associated with the environment. For example, during setup of adevice, the user may be queried to provide a naming indicator for thedevice, which may, in this example, be a naming indicator associatedwith a location within the environment that the device is situated.Additionally, or alternatively, the identifier of the device may belearned over time, such as through analysis of user utterancesindicating that the device is located in a given portion of theenvironment. It should be noted that while location-based identifiersare used herein, they are used by way of illustration only. Theidentifiers of devices may be any identifier, such as “Device 1,” “1,”or any other word, number, or combination thereof. The devices may eachhave their own device number or alpha-numeric identifier that may beutilized as the identifier of the device for purposes of sending andreceiving data. Using the example provided with respect to FIG. 5 , thetarget device may be determined to be the “kitchen” device.

In addition to the device identifier provided explicitly from the userutterance, one or more inferences may be made as to the target devices.For example, when the user utterance corresponds to an intent to movethe audio-session queue to a second device, if the second device is theonly other device associated with the first device, it may be inferredthat the target device is the second device

At block 612, the process 600 may include associating a state of thetarget device with a state of the source device. For example, eachdevice may be associated with a state. The state of a device may be, forexample, an audio-output state where the device is currently outputtingaudio and/or an inactive state where the device is not currentlyoutputting audio. Additionally, when multiple devices are outputtingdifferent audio, such as when a first device is outputting a first songand a second device is outputting a second song, each of the devices maybe associated with a different audio-output state. A state controller,such as the media-grouping state controller 140 described with respectto FIG. 1 , may be configured to identify and/or determine the state ofone or more of the devices. Based at least in part on receiving a userutterance to control audio output on the devices, the state controllermay cause control data to be sent to one or more of the devices tochange the state of those devices.

For example, a first device may be currently outputting audio associatedwith an audio-session queue. Based at least in part on the first devicecurrently outputting audio, the state controller may identify and/ordetermine that the first device is associated with a first audio-outputstate. A second device that is associated with the first device may notbe currently outputting audio. Based at least in part on the seconddevice not outputting audio, the state controller may identify and/ordetermine that the second device is associated with an inactive state.The state controller may also receive data from, for example, othercomponents of the remote system indicating that the user desires to actwith respect to output of the audio. For example, it may be determinedthat the user utterance corresponds to an intent to output the audio onthe second device and cease outputting audio on the first device. Thestate controller may, based at least in part on information provided bythe other components, cause the inactive state of the second device tochange to the audio-output state of the first device. The statecontroller may also cause the audio-output state of the first device tochange to an inactive state. In this example, the second device mayoutput the audio while the first device may cease outputting the audio.

The state controller may also be configured to cause a device ofmultiple associated devices to act as a hub device. The hub device maycontrol the other devices not designated as a hub device. In theseexamples, data may flow from the non-hub devices to the hub device,which may communicate on behalf of the hub device and the non-hubdevices with the remote system and/or a third-party remote system.Selection of the hub device is described in more detail with respect toFIG. 8 , below.

At block 614, the process 600 may include moving the audio-session queuefrom being associated with the source device to being associated withthe target device. For example, when the user utterance corresponds toan intent to output audio on a second device that is not currentlyoutputting the audio and to cease outputting audio on the first devicethat is currently outputting the audio, the second device may bedetermined to be target device, as described above with respect to block610. The audio-session queue may be dissociated from the source deviceand the audio-session queue may be associated with the second device asthe determined target device. In examples, associating and/ordissociating audio-session queues may be perform by an audio-sessionqueue storage/access component, such as the audio-session queuestorage/access component 148 described with respect to FIG. 1 . Inexamples, associating and/or dissociating audio-session queues may beperformed after confirmatory data has been received from the third partyassociated with the audio-session queue. The confirmatory data mayindicate that the intended retargeting of the audio-session queue from afirst device to a second device was successful.

At block 616, the process 600 may include causing output of audiorepresenting a response to the user utterance. For example, if theaudio-session queue was successfully associated with the second deviceand successfully dissociated from the first device, audio may be outputindicating that the command provided by the user was successfullycarried out. Output of audio may be performed via the one or morespeakers 114 of a communal device 102, for example.

FIG. 7 illustrates a flow diagram of a process 700 for causing one ormultiple devices to cease output of audio. The order in which theoperations or steps are described is not intended to be construed as alimitation, and any number of the described operations may be combinedin any order and/or in parallel to implement process 700.

At block 702, process 700 may include receiving audio data representinga user utterance. In examples, the audio data may be received from afirst device, which may be a communal device, such as the communaldevices 102 described above with respect to FIG. 1 . One or moremicrophones of the first device may capture audio representing the userutterance and may generate corresponding audio data. That audio data maybe sent from the first device to a remote system, for example, and maybe received at the remote system. In examples, the audio data may bereceived via an automatic speech recognition component, such as theautomatic speech recognition component 136 described with respect toFIG. 12 below. The first device may be situated in a first portion of anenvironment and may be associated with one or more other devicessituated in other portions of the environment.

At block 704, the process 700 may include determining an intent to ceaseoutput of audio on a device currently outputting audio. For example, thedevice currently outputting audio may be the first device from which theaudio data was received. In other examples, the device currentlyoutputting audio may be another device associated with the first device.To determine an intent, for example, automatic speech recognitiontechniques may be utilized to generate text data corresponding to theaudio data. The text data may represent words determined from the audiodata. Natural language understanding techniques may be utilized togenerate intent data that may represent an intent determined from thetext data. In examples, a natural language understanding component ofthe remote system, such as the natural language understanding component138 described with respect to FIG. 12 below, may be utilized. In thisexample, the user utterance may be, for example, “stop the music in thekitchen.” In this example, the first device and a second device may beoutputting audio corresponding to music. Based at least in part on theintent data, it may be determined that the user utterance corresponds toa “remove” intent, which may represent an intent to cease output ofaudio on the second device and to continue outputting audio on the firstdevice.

At block 706, the process 700 may include identifying an audio-sessionqueue associated with the device. An audio-session queue that indicatesa queue of songs to be output by the device may be associated with thedevice based at least in part on the device currently outputting theaudio. It should be understood that while one device is determined to bethe source device in this example, multiple devices may be determined tobe source devices based at least in part on audio currently being outputby the devices. In some examples, the audio-session queue is static,such as in situations where output of the queued songs is from an albumof fixed songs. In other examples, the audio-session queue may bedynamic and may change based at least in part on how a user interactswith the audio being output. For example, a user's indication that he orshe likes the song being output may cause the audio-session queue tochange such that similar songs to the liked song are added to the queue,or moved up in the queue, while dissimilar songs are removed from thequeue, or moved down in the queue. In examples, an audio-session queuestorage/access component, such as the audio-session queue storage/accesscomponent 134 described with respect to FIG. 1 , may be utilized toidentify the audio-session queue.

At block 708, the process 700 may include determining whether otherdevices are outputting the audio associated with the audio-sessionqueue. If not, the process 700 may continue to block 716 where theaudio-session queue may be dissociated from the device. In examples,associating and/or dissociating audio-session queues may be perform byan audio-session queue storage/access component, such as theaudio-session queue storage/access component 148 described with respectto FIG. 1 . At block 718, the process 700 may include causing output ofaudio representing a response to the user utterance. For example, whenoutput of the audio ceases on the device, the audio may provideconfirmation that the user's utterance has been successfully acted upon.Output of audio may be performed via the one or more speakers 114 of acommunal device 102, for example.

Returning to block 708, if other devices are outputting the audio, thenthe process 700 may continue to block 710, where the state of the devicemay be dissociated from the state of the other devices that areoutputting the audio. For example, each device may be associated with astate. The state of a device may be, for example, an audio-output statewhere the device is currently outputting audio and/or an inactive statewhere the device is not currently outputting audio. Additionally, whenmultiple devices are outputting different audio, such as when a firstdevice is outputting a first song and a second device is outputting asecond song, each of the devices may be associated with a differentaudio-output state. A state controller may be configured to identifyand/or determine the state of one or more of the devices. Based at leastin part on receiving a user utterance to control audio output on thedevices, the state controller may cause control data to be sent to oneor more of the devices to change the state of those devices.

For example, a first device and a second device may be currentlyoutputting audio associated with an audio-session queue. Based at leastin part on the first device and the second device currently outputtingthe audio, the state controller may identify and/or determine that thefirst device and the second device are associated with an audio-outputstate. The state controller may also receive data from, for example,other components of the remote system indicating that the user desiresto act with respect to output of the audio. For example, it may bedetermined that the user utterance corresponds to an intent to ceaseoutputting the audio on the first device and continue outputting theaudio on the second device. The state controller may, based at least inpart on information provided by the other components, cause theaudio-output state of the first device to change to the inactive state.In this example, the second device may output the audio while the firstdevice may cease outputting the audio.

The state controller may also be configured to cause a device ofmultiple associated devices to act as a hub device. The hub device maycontrol the other devices not designated as a hub device. In theseexamples, data may flow from the non-hub devices to the hub device,which may communicate on behalf of the hub device and the non-hubdevices with the remote system and/or a third-party remote system.Selection of the hub device is described in more detail with respect toFIG. 8 , below.

At block 712, the process 700 may include dissociating the audio-sessionqueue from the device. In examples, associating and/or dissociatingaudio-session queues may be performed after confirmatory data has beenreceived from the third party associated with the audio-session queue.The confirmatory data may indicate that the intended retargeting of theaudio-session queue from a first device and a second device to just asecond device was successful.

Returning to block 708, if other devices are not outputting the audioassociated with the audio-session queue, the process 700 may continue toblock 714 where the audio-session queue may be dissociated from thedevice currently outputting the audio. At block 716, the process 700 mayinclude causing output of a response to the user utterance confirmingthat the command has been processed.

FIG. 8 illustrates a schematic diagram of an example environment forselecting one of multiple devices as a hub device. As illustrated,devices, also described as communal devices, include one or moreprocessors 802(1), 802(2), and 802(3). As noted above, in some instanceseach communal device 108(1)-(3) may include a single radio unit tocommunicate over multiple protocols (e.g., Bluetooth and BLE), two ormore radio units to communicate over two or more protocols, or the like.As used herein, a “radio” and “radio component” may be usedinterchangeably. Again, in some instances, the devices include any othernumber of radios, including instances where the devices comprise asingle radio configured to communicate over two or more differentprotocols.

In addition to the above, the devices 108(1)-(3) may include respectivememory (or “computer-readable media”) 810(1), 810(2), and 810(3), whichmay store respective instances of a hub-selection component 812(1),812(2), and 812(3). The hub-selection components 812(1)-(3) may generatemessages (e.g., audio-session queue messages, communication-strengthmessages, etc.) and one or more maps (e.g., audio-session queue maps,communication-strength maps, etc.), and may be used to select/determinethe communication hub. Further, the hub-selection components 812(1)-(3)may send and/or receive the hub-selection messages and store anindication of the selected hub and the amount of time for which theselected device is to be act as the hub. The hub-selection components812(1)-(3) may also set a timer for determining the amount of time forwhich the selected device is to act as a hub, or may otherwise determinewhen the time for the device to act as the hub has elapsed.

In some instances, messages sent by each device indicate a current stateof the device and whether the device is associated with an audio-sessionqueue (also referred to as a “state value”), a current connectionstrength to the WLAN of the device, information identifying the WLAN,information identifying the device, and/or the like. With thisinformation, each hub-selection component 812(1)-(3) may determine thedevice that is to be selected as the communication hub. In someinstances, the hub-selection components 812(1)-(3) may implement analgorithm that selects the device that is associated with anaudio-session queue and/or the device that was first associated with agiven audio-session queue as the communication hub. In other instances,the components 812(1)-(3) may select the device having the highestconnection strength as the communication hub. In still other instances,each component is configured to implement a cost function that selectsthe communication hub based on one or more weighted factors, such ascurrent association with audio-session queues, connection strengths, andso forth. In other examples, one of the devices may be designated by theuser as the hub and/or one of the device may include additionalcomponents and/or functionality and may be designed as the hub based atleast in part on those additional components and/or functionality.

The communal devices 108(1)-(3) and a primary device may couple with oneanother over a short-range wireless network and thus collectivelyforming a piconet 108. In the illustrated example, each of the devicescomprise devices configured to communicate both with one another over ashort-range connection as well as over a network 118. In some instances,meanwhile, while some of the communal devices 108(1)-(3) may beconfigured to communicate over a short-range wireless network and overthe network 118, the other communal devices 108(1)-(3) may be configuredto communicate over multiple short-range wireless protocols (e.g.,Bluetooth, BLE, etc.) while being incapable of communicating over thenetwork 118. In these instances, the communal devices 108(1)-(3) mayselect a communication hub that communicates with the other communaldevices over a low-power protocol while communicating with the hubdevice over a higher-power protocol. The hub device may then communicatethese messages over the network 118.

Additionally, one or more hub-selection message may be sent betweencommunal devices in response to determining that a device it is to actas the communication hub. For instance, one or more of the non-hubdevices may send a message and/or a remote system may send a message. Asillustrated, the hub-selection message may indicate the deviceidentification (DID) of the selected communication hub, in this example,the DID of the first communal device 108(1), as well as the amount oftime for which the selected accessory device is to act as thecommunication hub. In examples, this amount of time may be preconfiguredand constant, while in other instances it may vary depending onassociations between the devices and an audio-session queue, the numberof devices in the piconet, or the like. In response to receiving thehub-selection message, the non-hub devices may store an indication ofthe DID of the communication hub as well as the amount of time for whichthe selected accessory device is to act as the communication hub. Thedevices may then again send out messages after expiration of the amountof time or just prior to expiration of this amount of time to determineif the hub communication device should change.

FIGS. 9-11 illustrate various processes for audio content outputcontrol. The processes described herein are illustrated as collectionsof blocks in logical flow diagrams, which represent a sequence ofoperations, some or all of which may be implemented in hardware,software or a combination thereof. In the context of software, theblocks may represent computer-executable instructions stored on one ormore computer-readable media that, when executed by one or moreprocessors, program the processors to perform the recited operations.Generally, computer-executable instructions include routines, programs,objects, components, data structures and the like that performparticular functions or implement particular data types. The order inwhich the blocks are described should not be construed as a limitation,unless specifically noted. Any number of the described blocks may becombined in any order and/or in parallel to implement the process, oralternative processes, and not all of the blocks need be executed. Fordiscussion purposes, the processes are described with reference to theenvironments, architectures and systems described in the examplesherein, such as, for example those described with respect to FIGS. 1-4,8, 12, and 13 , although the processes may be implemented in a widevariety of other environments, architectures and systems.

FIG. 9 illustrates a flow diagram of an example process 900 for contentplayback control. The order in which the operations or steps aredescribed is not intended to be construed as a limitation, and anynumber of the described operations may be combined in any order and/orin parallel to implement process 900.

At block 902, process 900 may include receiving, from a first deviceassociated with a wireless network, audio data representing a userutterance. The first device may be operating in first state indicatingthat the first device is outputting audio content. In examples, thefirst device may be a communal device, such as the communal devices 102described above with respect to FIG. 1 . One or more microphones of thefirst device may capture audio representing the user utterance and maygenerate corresponding audio data. That audio data may be sent from thefirst device to a remote system, for example, and may be received at theremote system. The first device may be situated in a first portion of anenvironment and may be associated with one or more other devicessituated in other portions of the environment.

At block 904, the process 900 may include determining, from the audiodata, intent data indicating a request to add the audio content to asecond device associated with the wireless network while synchronouslyoutputting the audio content by the first device. The second device maybe operating in a second state indicating the second device is notoutputting audio content. For example, automatic speech recognitiontechniques may be utilized to generate text data corresponding to theaudio data. The text data may represent words determined from the audiodata. Natural language understanding techniques may be utilized togenerate the intent data that may represent an intent determined fromthe text data. In this example, the user utterance may be, for example,“add the music to the kitchen.” In this example, the first device may beoutputting audio corresponding to music. Based at least in part on theintent data, it may be determined that the user utterance corresponds toan “add” intent, which may represent an intent to output audio on thesecond device, which in this example would be associated with the namingindicator “kitchen,” in addition to continuing to output audio on thefirst device.

At block 906, the process 900 may include causing, from the intent data,the second device to transition from the second state to the firststate. For example, each device may be associated with a state, asdescribed above. The state of a device may be, for example, anaudio-output state where the device is currently outputting audio and/oran inactive state where the device is not currently outputting audio.Additionally, when multiple devices are outputting different audio, suchas when a first device is outputting a first song and a second device isoutputting a second song, each of the devices may be associated with adifferent audio-output state. A state controller may be configured toidentify and/or determine the state of one or more of the devices. Basedat least in part on receiving a user utterance to control audio outputon the devices, the state controller may cause control data to be sentto one or more of the devices to change the state of those devices.

For example, the first device may be currently outputting audioassociated with an audio-session queue. Based at least in part on thefirst device currently outputting audio, the state controller mayidentify and/or determine that the first device is associated with afirst audio-output state. The second device that is associated with thefirst device may not be currently outputting audio. Based at least inpart on the second device not outputting audio, the state controller mayidentify and/or determine that the second device is associated with aninactive state. The state controller may also receive data from, forexample, other components of the remote system indicating that the userdesires to act with respect to output of the audio. For example, it maybe determined that the user utterance corresponds to an intent to outputthe audio on the second device in addition to the first device, orotherwise to add the second device to the audio session. The statecontroller may, based at least in part on information provided by theother components, cause the inactive state of the second device tochange to the audio-output state of the first device. In this example,actions taken by the first device may also be taken by the seconddevice, such as, for example, outputting the audio, accessingaudio-session queues, and/or controlling audio output volumes.

The state controller may also be configured to cause a device ofmultiple associated devices to act as a hub device. The hub device maycontrol the other devices not designated as a hub device. In theseexamples, data may flow from the non-hub devices to the hub device,which may communicate on behalf of the hub devices and the non-hubdevices with the remote system and/or a third-party remote system.Selection of the hub device is described in more detail with respect toFIG. 8 , above. In this example, the first device may be selected as thehub device based at least in part on the first device being the sourcedevice before the audio data was received.

At block 908, the process 900 may include identifying, from the firstdevice outputting the audio content, queue data associated with theaudio content. The queue data may represent a queue of audio files. Forexample, the queue data may represent an audio-session queue. Anaudio-session queue that indicates a queue of songs to be output by thefirst device may be associated with the first device based at least inpart on the first device currently outputting the audio. It should beunderstood that while one device is determined to be the source devicein this example, multiple device may be determined to be source devicesbased at least in part on audio currently being output by the devices.

In some examples, the audio-session queue is static, such as insituations where output of the queued songs is from an album of fixedsongs. In other examples, the audio-session queue may be dynamic and maychange based at least in part on how a user interacts with the audiobeing output. For example, a user's indication that he or she likes thesong being output may cause the audio-session queue to change such thatsimilar songs to the liked song are added to the queue, or moved up inthe queue, while dissimilar songs are removed from the queue, or moveddown in the queue.

At block 910, the process 900 may include associating, from causing thesecond device to transition to the first state, the queue data with thesecond device such that a first identifier of the first device isidentified as being configured to access the queue of audio file and asecond identifier of the second device is identified as being configuredto access the queue of audio files. The audio-session queue may also bedissociated from the first device, which may be described as the sourcedevice. The audio-session queue may then be associated with the firstdevice and the second device as the determined target devices. Inexamples, associating and/or dissociating audio-session queues may beperformed after confirmatory data has been received from the third partyassociated with the audio-session queue. The confirmatory data mayindicate that the intended retargeting of the audio-session queue from afirst device to a first device and a second device was successful.

At block 912, the process 900 may include sending a first command to thefirst device to output the audio content such that the first device andthe second device output the audio content synchronously. For example,if the audio data was received during output of audio, such as in themiddle of outputting a song, the first device may continue to outputaudio corresponding to the song without interruption.

At block 914, the process 900 may include sending a second command tothe second device to access the queue of audio files and to output theaudio content such that the first device and the second device outputthe audio content synchronously. The second device may output the audiocorresponding to a portion of the song that has not been output by thefirst device. In this way, the first device and the second device mayoutput the same audio, or instances of the same audio, at the same timeor at substantially similar times. The second command may be generatedand sent from the remote system and/or the second command may begenerated and/or sent from the first device and/or another device suchas a smart-home hub device.

The process 900 may additionally, or alternatively, include receiving,from a third device associated with the wireless network, second audiodata representing a second user utterance. The process 900 may alsoinclude determining, from the second audio data, second intent dataindicating a request to identify the audio content being output by thefirst device. The queue data may be determined to be associated with thefirst device and a portion of the audio content being output by thefirst device may be identified. The process 900 may also include causingoutput, via the third device, of audio corresponding to a response tothe request. The response may be based at least in part on the portionof the audio content being output. In this way, devices that areoutputting the audio content, such as the first device and/or the seconddevice, may be queried to provide information about the audio contentbeing output. Additionally, devices that are not outputting the audiocontent but that are associated with at least one of the device that areoutputting the audio content may also be queried to provide theinformation.

The process 900 may additionally, or alternatively, include receiving,from the second device, second audio data representing a second userutterance. The process 900 may also include determining, from the secondaudio data, second intent data indicating a request to cease output ofthe audio content. Based at least in part on receiving the second audiodata from the second device and the second intent data, the process 900may include causing the audio content to cease being output by thesecond device. Additionally, the audio content may be caused to beceased from being output by the first device based at least in part onthe first device and the second device operating in the first state.

The process 900 may additionally, or alternatively, include determining,via automatic speech recognition, text data corresponding to the userutterance. The process 900 may also include determining that the textdata includes a word that corresponds to an anaphora. The process 900may also include determining that the anaphora corresponds to the audiocontent based on the audio content being output by the first device atthe time the audio data was received. In this example, the anaphora maybe the word “this,” and based at least in part on the first deviceoutputting the audio content, it may be determined that “this”corresponds to the audio content. Determining the intent datarepresenting the intent to output the audio content on the second devicemay be based at least in part on determining that the anaphora refers tothe audio content.

FIG. 10 illustrates a flow diagram of another example process 1000 forcontent playback control. The order in which the operations or steps aredescribed is not intended to be construed as a limitation, and anynumber of the described operations may be combined in any order and/orin parallel to implement process 1000.

At block 1002, process 1000 may include receiving input datarepresenting a user utterance made while a first device outputs audiocontent. The input data may be received via a first device. In examples,the input data may be audio data. In other examples, the input data maybe a command from, for example, an application running on a device beingused by the user, such as a mobile phone. The first device may outputaudio content in a first state. In examples, the first device may be acommunal device, such as the communal devices 102 described above withrespect to FIG. 1 . One or more microphones of the first device maycapture audio representing the user utterance and may generatecorresponding audio data. That audio data may be sent from the firstdevice to a remote system, for example, and may be received at theremote system. The first device may be situated in a first portion of anenvironment and may be associated with one or more other devicessituated in other portions of the environment.

At block 1004, the process 1000 may include determining, from the inputdata, that the audio content is to be output by a second device in timesynchronization with the first device. For example, automatic speechrecognition techniques may be utilized to generate text datacorresponding to the input data. The text data may represent wordsdetermined from the input data. Natural language understandingtechniques may be utilized to generate the intent data that mayrepresent an intent determined from the text data. In this example, theuser utterance may be, for example, “add the music to the kitchen.” Inthis example, the first device may be outputting audio corresponding tomusic. Based at least in part on the intent data, it may be determinedthat the user utterance corresponds to an “add” intent, which mayrepresent an intent to output audio on the second device, which in thisexample would be associated with the naming indicator “kitchen,” inaddition to continuing to output audio on the first device.

At block 1006, the process 1000 may include causing, based at least inpart on determining that the audio content is to be output by the firstdevice and the second device in time synchronization, the second deviceto be associated with the first device such that at least some actionsperformed by the first device are performed by the second device. Forexample, each device may be associated with a state, as described above.The state of a device may be, for example, an audio-output state wherethe device is currently outputting audio and/or an inactive state wherethe device is not currently outputting audio. Additionally, whenmultiple devices are outputting different audio, such as when a firstdevice is outputting a first song and a second device is outputting asecond song, each of the devices may be associated with a differentaudio-output state. A state controller may be configured to identifyand/or determine the state of one or more of the devices. Based at leastin part on receiving a user utterance to control audio output on thedevices, the state controller may cause control data to be sent to oneor more of the devices to change the state of those devices.

For example, the first device may be currently outputting audioassociated with an audio-session queue. Based at least in part on thefirst device currently outputting audio, the state controller mayidentify and/or determine that the first device is associated with afirst audio-output state. The second device that is associated with thefirst device may not be currently outputting audio. Based at least inpart on the second device not outputting audio, the state controller mayidentify and/or determine that the second device is associated with aninactive state. The state controller may also receive data from, forexample, other components of the remote system indicating that the userdesires to act with respect to output of the audio. For example, it maybe determined that the user utterance corresponds to an intent to outputthe audio on the second device in addition to the first device, orotherwise to add the second device to the audio session. The statecontroller may, based at least in part on information provided by theother components, cause the inactive state of the second device tochange to the audio-output state of the first device. In this example,actions taken by the first device may also be taken by the seconddevice, such as, for example, outputting the audio, accessingaudio-session queues, and/or controlling audio output volumes.

The state controller may also be configured to cause a device ofmultiple associated devices to act as a hub device. The hub device maycontrol the other devices not designated as a hub device. In theseexamples, data may flow from the non-hub devices to the hub device,which may communicate on behalf of the hub devices and the non-hubdevices with the remote system and/or a third-party remote system.Selection of the hub device is described in more detail with respect toFIG. 8 , above. In this example, the first device may be selected as thehub device based at least in part on the first device being the sourcedevice before the audio data was received.

At block 1008, the process 1000 may include identifying queue dataassociated with the audio content. The queue data may be associated withthe first device based at least in part on the first device outputtingthe audio content. For example, the queue data may represent anaudio-session queue. An audio-session queue that indicates a queue ofsongs to be output by the first device may be associated with the firstdevice based at least in part on the first device currently outputtingthe audio. It should be understood that while one device is determinedto be the source device in this example, multiple device may bedetermined to be source devices based at least in part on audiocurrently being output by the devices.

In some examples, the audio-session queue is static, such as insituations where output of the queued songs is from an album of fixedsongs. In other examples, the audio-session queue may be dynamic and maychange based at least in part on how a user interacts with the audiobeing output. For example, a user's indication that he or she likes thesong being output may cause the audio-session queue to change such thatsimilar songs to the liked song are added to the queue, or moved up inthe queue, while dissimilar songs are removed from the queue, or moveddown in the queue.

At block 1010, the process 1000 may include associating, based at leastin part on causing the second device to be associated with the firstdevice, the queue data with the second device. The audio-session queuemay also be dissociated from the first device, which may be described asthe source device. The audio-session queue may then be associated withthe first device and the second device as the determined target devices.In examples, associating and/or dissociating audio-session queues may beperformed after confirmatory data has been received from the third partyassociated with the audio-session queue. The confirmatory data mayindicate that the intended retargeting of the audio-session queue from afirst device to a first device and a second device was successful.

At block 1012, the process 1000 may include causing the second device tooutput the audio content in time synchronization with output of theaudio content by the first device. For example, if the audio data wasreceived during output of audio, such as in the middle of outputting asong, the first device may continue to output audio corresponding to thesong without interruption. Additionally, the second device may outputthe audio corresponding to a portion of the song that has not beenoutput by the first device. In this way, the first device and the seconddevice may output the same audio, or instances of the same audio, at thesame time or at substantially similar times. As used herein, “in timesynchronization” means that the first device and the second deviceoutput the audio, or instances of the audio, at the same time or atsubstantially similar times. For example, there may be a 0.1 to 25millisecond difference and/or delay between output of the audio by thefirst device as compared to output of the audio by the second device.

The process 1000 may additionally, or alternatively, include receiving,from a third device located in a third environment, second input datarepresenting a second user utterance. The process 1000 may also includedetermining, from the second input data, intent data indicating arequest to identify the audio content being output by the first device.The queue data may be determined to be associated with the first deviceand a portion of the audio content being output by the first device maybe identified. The process 1000 may also include causing output, via thethird device, of audio corresponding to a response to the request. Theresponse may be based at least in part on the portion of the audiocontent being output. In this way, devices that are outputting the audiocontent, such as the first device and/or the second device, may bequeried to provide information about the audio content being output.Additionally, devices that are not outputting the audio content but thatare associated with at least one of the device that are outputting theaudio content may also be queried to provide the information.

The process 1000 may additionally, or alternatively, include receiving,from the second device, second input data representing a second userutterance. The process 1000 may also include determining, from thesecond input data, intent data indicating a request to alter output ofthe audio content. Based at least in part on receiving the second inputdata from the second device and the intent data, the process 1000 mayinclude causing output of the audio content to be altered. For example,the process 1000 may include generating, based at least in part on theintent data, directive data indicating that the audio content output bythe second device is to be altered. Additionally, the audio content maybe altered via the first device based at least in part on the firstdevice and the second device operating in the first state.

The process 1000 may additionally, or alternatively, include determiningthat the user utterance includes an anaphora and determining that theanaphora corresponds to the audio content based on the audio contentbeing output by the first device at the time the input data wasreceived. In this example, the anaphora may be the word “this,” andbased at least in part on the first device outputting the audio content,it may be determined that “this” corresponds to the audio content.Determining that the audio content is to be output by the first deviceand the second device may be based at least in part on determining thatthe anaphora refers to the audio content.

The process 1000 may additionally, or alternatively, include determiningthat the user utterance includes an anaphora and determining that theanaphora corresponds to an identification of the first device based atleast in part on the input data being received via the first device. Inthis example, the anaphora may be the word “here,” and based at least inpart on receiving the input data from the first device, it may bedetermined that “here” corresponds to the first device. Determining thatthe audio content is to be output by the first device and the seconddevice may be based at least in part on determining that the anaphorarefers to the first device.

The process 1000 may additionally, or alternatively, include determiningthat an amount of time has passed since the queue data was associatedwith the second device and determining that the amount of time is morethan a threshold amount of time. The process 1000 may also includecausing the second device to be dissociated from the first device basedat least in part on determining that the amount of time is more than thethreshold amount of time. Dissociating devices may also be based atleast in part on a determination that the association of the devicesoccurs on a previous day. The states of the devices may also bedissociated and the audio-session queue may be dissociated from one orall of the previously-associated devices.

The process 1000 may also include receiving, via the first device,second input data representing a second user utterance and determiningintent data indicating a request to output second audio content. Theprocess 1000 may also include determining, based at least in part on theintent data, that the second audio content is to be output via the firstdevice without altering output of the first audio content via the seconddevice. The process 1000 may also include causing the second device tobe dissociated from the first device based at least in part ondetermining that the second audio content is to be output via the firstdevice without altering output of the first audio content via the seconddevice.

The process 1000 may also include receiving, via the first device,second input data representing a second user utterance and determining,based at least in part on the second input data, intent data indicatinga request to output second audio content. The process 1000 may alsoinclude determining that the second device is outputting the first audiocontent and causing the first device to output audio representing arequest to authorize the second audio content to be output via thesecond device. Third input data representing a response to the requestby be received via the first device and the process 1000 may includecausing the second device to output the second audio content based atleast in part on the third input data indicating authorization.

The process 1000 may additionally, or alternatively, include receiving,via the second device, second input data representing a second userutterance, wherein receiving the second input data may correspond to anevent associated with the second device. The process 1000 may alsoinclude determining, based at least in part on the second input data,intent data indicating a request to alter output of the audio contentand generating, based at least in part on the second intent datadirective data. The directive data may indicate that the audio contentoutput by the second device is to be altered. The process 1000 may alsoinclude sending to the second device, the directive data and causing theaudio content to be altered on the first device and the second devicebased at least in part on sending the directive data to the seconddevice. In this way, an event that alters output of audio content on onedevice of multiple devices operating in the same or a similar stateand/or that are associated with an audio-session queue may result indata corresponding to one event to be sent to the remote system forprocessing and may result in directive data being sent to just one ofthe devices. From there, the other associated devices may be caused toalter the audio content output based at least in part on the associatedstate and/or associated audio-session queue.

The process 1000 may additionally, or alternatively, include sending, toa remote system, state data indicating that the first device and thesecond device are operating in a first state and sending, to the remotesystem, queue-association data indicating that the queue data isassociated with the first device and the second device. The process 1000may also include receiving, from the remote system, request dataindicating a request to alter output of the audio content and sending,to at least one of the first device or the second device, directive datarepresenting a directive to alter output of the audio content on thefirst device and the second device based at least in part on the statedata and the queue-association data. In this way, data indicating whichdevices are operating in the same or a similar state and which devicesare associated with an audio-session queue may be communicated withthird parties that, for example, provide one or more of the devices onwhich audio content is output, provide one or more applications forcontrolling audio content output, and/or provide means for accessingand/or generating audio-session queues.

FIG. 11 illustrates a flow diagram of an example process 1100 forcontent playback control. The order in which the operations or steps aredescribed is not intended to be construed as a limitation, and anynumber of the described operations may be combined in any order and/orin parallel to implement process 1100.

At block 1102, process 1100 may include receiving input datarepresenting a user utterance. The input data may be received from afirst device. In examples, the input data may be audio data. In otherexamples, the input data may be a command from, for example, anapplication running on a device being used by the user, such as a mobilephone. The first device may output audio content and may be operating ina first state. In examples, the first device may be a communal device,such as the communal devices 102 described above with respect to FIG. 1. One or more microphones of the first device may capture audiorepresenting the user utterance and may generate corresponding inputdata. That input data may be sent from the first device to a remotesystem, for example, and may be received at the remote system. The firstdevice may be situated in a first portion of an environment and may beassociated with one or more other devices situated in other portions ofthe environment.

At block 1104, the process 1100 may include determining, from the inputdata, that audio content is to be output by a first device instead of asecond device currently outputting the audio content. The first devicemay be associated with a first state. The second device may beassociated with a second state. For example, automatic speechrecognition techniques may be utilized to generate text datacorresponding to the input data. The text data may represent wordsdetermined from the input data. Natural language understandingtechniques may be utilized to generate the intent data that mayrepresent an intent determined from the text data. In this example, theuser utterance may be, for example, “move the music to the kitchen.” Inthis example, the second device may be outputting audio corresponding tomusic. Based at least in part on the intent data, it may be determinedthat the user utterance corresponds to an “move” intent, which mayrepresent an intent to output audio on the first device, which in thisexample would be associated with the naming indicator “kitchen,” and tocease output of the audio content by the second device.

At block 1106, the process 1100 may include causing, based at least inpart on determining that the audio content is to be output by the firstdevice instead of the second device, the first device to be associatedwith a state of the second device. For example, each device may beassociated with a state, as described above. The state of a device maybe, for example, an audio-output state where the device is currentlyoutputting audio and/or an inactive state where the device is notcurrently outputting audio. Additionally, when multiple devices areoutputting different audio, such as when a first device is outputting afirst song and a second device is outputting a second song, each of thedevices may be associated with a different audio-output state. A statecontroller may be configured to identify and/or determine the state ofone or more of the devices. Based at least in part on receiving a userutterance to control audio output on the devices, the state controllermay cause control data to be sent to one or more of the devices tochange the state of those devices.

For example, the second device may be currently outputting audioassociated with an audio-session queue. Based at least in part on thesecond device currently outputting audio, the state controller mayidentify and/or determine that the second device is associated with afirst audio-output state. The first device that is associated with thesecond device may not be currently outputting audio. Based at least inpart on the first device not outputting audio, the state controller mayidentify and/or determine that the first device is associated with aninactive state. The state controller may also receive data from, forexample, other components of the remote system indicating that the userdesires to act with respect to output of the audio. For example, it maybe determined that the user utterance corresponds to an intent to outputthe audio on the first device instead of on the second device, orotherwise to move the audio session from the second device to the firstdevice. The state controller may, based at least in part on informationprovided by the other components, cause the inactive state of the firstdevice to change to the audio-output state of the second device. Thestate controller may also cause the audio-output state of the seconddevice to change to an inactive state.

The state controller may also be configured to cause a device ofmultiple associated devices to act as a hub device. The hub device maycontrol the other devices not designated as a hub device. In theseexamples, data may flow from the non-hub devices to the hub device,which may communicate on behalf of the hub devices and the non-hubdevices with the remote system and/or a third-party remote system.Selection of the hub device is described in more detail with respect toFIG. 8 , above. In this example, the first device may be selected as thehub device based at least in part on the first device being the sourcedevice before the input data was received. Alternatively, the seconddevice may be selected as the hub device based at least in part on theaudio session being moved to the second device.

At block 1108, the process 1100 may include identifying queue dataassociated with the audio content being output by the second device. Thequeue data may be associated with the second device based at least inpart on the second device outputting the audio content. For example, thequeue data may represent an audio-session queue. An audio-session queuethat indicates a queue of songs to be output by the second device may beassociated with the second device based at least in part on the seconddevice currently outputting the audio. It should be understood thatwhile one device is determined to be the source device in this example,multiple device may be determined to be source devices based at least inpart on audio currently being output by the devices.

In some examples, the audio-session queue is static, such as insituations where output of the queued songs is from an album of fixedsongs. In other examples, the audio-session queue may be dynamic and maychange based at least in part on how a user interacts with the audiobeing output. For example, a user's indication that he or she likes thesong being output may cause the audio-session queue to change such thatsimilar songs to the liked song are added to the queue, or moved up inthe queue, while dissimilar songs are removed from the queue, or moveddown in the queue.

At block 1110, the process 1100 may include causing the queue data to bedissociated from the second device. In examples, associating and/ordissociating audio-session queues may be performed after confirmatorydata has been received from the third party associated with theaudio-session queue. The confirmatory data may indicate that theintended retargeting of the audio-session queue from a first device to asecond device was successful.

At block 1112, the process 1100 may include associating, based at leastin part on causing the first device to be associated with the state ofthe second device, the queue data with the first device. Theaudio-session queue may then be associated with the first device as thedetermined target device. In examples, associating and/or dissociatingaudio-session queues may be performed after confirmatory data has beenreceived from the third party associated with the audio-session queue.The confirmatory data may indicate that the intended retargeting of theaudio-session queue from a first device to a second device wassuccessful.

At block 1114, the process 1100 may include causing the second device tocease outputting audio content.

At block 1116, the process 1100 may include causing the first device tooutput the audio content. For example, if the input data was receivedduring output of audio, such as in the middle of outputting a song, thesecond device may output the audio corresponding to a portion of thesong that has not been output by the second device. In this way, thefirst device may output the same audio, or an instance of the sameaudio, at the same time or at substantially similar times as the seconddevice would have output the audio if the second device were not removedfrom the audio-session queue.

The process 1100 may additionally, or alternatively, include receiving,from a third device, second input data representing a second userutterance. The process 1100 may also include determining, from thesecond input data, intent data indicating a request to identify theaudio content being output by the first device. The queue data may bedetermined to be associated with the first device and a portion of theaudio content being output by the first device may be identified. Theprocess 1100 may also include causing output, via the third device, ofaudio corresponding to a response to the request. The response may bebased at least in part on the portion of the audio content being output.In this way, devices that are outputting the audio content, such as thefirst device and/or the second device, may be queried to provideinformation about the audio content being output. Additionally, devicesthat are not outputting the audio content but that are associated withat least one of the device that are outputting the audio content mayalso be queried to provide the information.

The process 1100 may additionally, or alternatively, include receiving,from the third device, second input data representing a second userutterance. The process 1100 may also include determining, from thesecond input data, intent data indicating a request to alter output ofthe audio content. Based at least in part on receiving the second inputdata from the second device and the intent data, the process 1100 mayinclude causing output of the audio content to be altered.

The process 1100 may additionally, or alternatively, include determiningthat the user utterance includes an anaphora and determining that theanaphora corresponds to the audio content based on the audio contentbeing output by the first device at the time the input data wasreceived. In this example, the anaphora may be the word “this,” andbased at least in part on the first device outputting the audio content,it may be determined that “this” corresponds to the audio content.Determining that the audio content is to be output by the first deviceinstead of the second device may be based at least in part ondetermining that the anaphora refers to the audio content.

The process 1100 may additionally, or alternatively, include determiningthat the user utterance includes an anaphora and determining that theanaphora corresponds to an identification of the second device based atleast in part on the input data being received via the second device. Inthis example, the anaphora may be the word “here,” and based at least inpart on receiving the input data from the second device, it may bedetermined that “here” corresponds to the second device. Determiningthat the audio content is to be output by the first device instead ofthe second device may be based at least in part on determining that theanaphora refers to the second device.

The process 1100 may additionally, or alternatively, include determiningthat an amount of time has passed since the queue data was associatedwith the second device and determining that the amount of time is morethan a threshold amount of time. The process 1100 may also includecausing the second device to be dissociated from the first device basedat least in part on determining that the amount of time is more than thethreshold amount of time. Dissociating devices may also be based atleast in part on a determination that the association of the devicesoccurs on a previous day. The states of the devices may also bedissociated and the audio-session queue may be dissociated from one orall of the previously-associated devices.

The process 1100 may also include receiving, via the first device,second input data representing a second user utterance and determiningintent data indicating a request to output second audio content. Theprocess 1100 may also include determine that the second audio content isto be output via the second device without altering output of the firstaudio content via the first device. The process 1100 may also includecausing the second device to output the second audio contentconcurrently with the first device outputting the first audio content.

The process 1100 may also include receiving, via the first device,second input data representing a second user utterance and determining,based at least in part on the second input data, intent data indicatinga request to output second audio content. The process 1100 may alsoinclude determining that the first device is outputting the first audiocontent and causing the first device to output audio representing arequest to authorize the second audio content to be output via the firstdevice. Third input data representing a response to the request by bereceived via the first device and the process 1100 may include causingthe first device to output the second audio content based at least inpart on the third input data indicating authorization.

FIG. 12 illustrates a conceptual diagram of how a spoken utterance canbe processed, allowing a system to capture and execute commands spokenby a user, such as spoken commands that may follow a wakeword, ortrigger expression, (i.e., a predefined word or phrase for “waking” adevice, causing the device to begin sending audio data to a remotesystem, such as system 120). The various components illustrated may belocated on a same or different physical devices. Communication betweenvarious components illustrated in FIG. 12 may occur directly or across anetwork 118. An audio capture component, such as a microphone 112 of thedevice 102, or another device, captures audio 1200 corresponding to aspoken utterance. The device 102 or 104, using a wakeword detectionmodule 1201, then processes audio data corresponding to the audio 1200to determine if a keyword (such as a wakeword) is detected in the audiodata. Following detection of a wakeword, the device 102 or 104 sendsaudio data 1202 corresponding to the utterance to the remote system 120that includes an ASR module 136. The audio data 1202 may be output froman optional acoustic front end (AFE) 1256 located on the device prior totransmission. In other instances, the audio data 1202 may be in adifferent form for processing by a remote AFE 1256, such as the AFE 1256located with the ASR module 136 of the remote system 120.

The wakeword detection module 1201 works in conjunction with othercomponents of the user device, for example a microphone to detectkeywords in audio 1200. For example, the device may convert audio 1200into audio data, and process the audio data with the wakeword detectionmodule 1201 to determine whether human sound is detected, and if so, ifthe audio data comprising human sound matches an audio signature and/ormodel corresponding to a particular keyword.

The user device may use various techniques to determine whether audiodata includes human sound. Some embodiments may apply voice activitydetection (VAD) techniques. Such techniques may determine whether humansound is present in an audio input based on various quantitative aspectsof the audio input, such as the spectral slope between one or moreframes of the audio input; the energy levels of the audio input in oneor more spectral bands; the signal-to-noise ratios of the audio input inone or more spectral bands; or other quantitative aspects. In otherembodiments, the user device may implement a limited classifierconfigured to distinguish human sound from background noise. Theclassifier may be implemented by techniques such as linear classifiers,support vector machines, and decision trees. In still other embodiments,Hidden Markov Model (HMM) or Gaussian Mixture Model (GMM) techniques maybe applied to compare the audio input to one or more acoustic models inhuman sound storage, which acoustic models may include modelscorresponding to human sound, noise (such as environmental noise orbackground noise), or silence. Still other techniques may be used todetermine whether human sound is present in the audio input.

Once human sound is detected in the audio received by user device (orseparately from human sound detection), the user device may use thewakeword detection module 1201 to perform wakeword detection todetermine when a user intends to speak a command to the user device.This process may also be referred to as keyword detection, with thewakeword being a specific example of a keyword. Specifically, keyworddetection may be performed without performing linguistic analysis,textual analysis or semantic analysis. Instead, incoming audio (or audiodata) is analyzed to determine if specific characteristics of the audiomatch preconfigured acoustic waveforms, audio signatures, or other datato determine if the incoming audio “matches” stored audio datacorresponding to a keyword.

Thus, the wakeword detection module 1201 may compare audio data tostored models or data to detect a wakeword. One approach for wakeworddetection applies general large vocabulary continuous speech recognition(LVCSR) systems to decode the audio signals, with wakeword searchingconducted in the resulting lattices or confusion networks. LVCSRdecoding may require relatively high computational resources. Anotherapproach for wakeword spotting builds hidden Markov models (HMM) foreach key wakeword word and non-wakeword speech signals respectively. Thenon-wakeword speech includes other spoken words, background noise, etc.There can be one or more HMMs built to model the non-wakeword speechcharacteristics, which are named filler models. Viterbi decoding is usedto search the best path in the decoding graph, and the decoding outputis further processed to make the decision on keyword presence. Thisapproach can be extended to include discriminative information byincorporating hybrid DNN-HMM decoding framework. In another embodiment,the wakeword spotting system may be built on deep neural network(DNN)/recursive neural network (RNN) structures directly, without HMMinvolved. Such a system may estimate the posteriors of wakewords withcontext information, either by stacking frames within a context windowfor DNN, or using RNN. Following-on posterior threshold tuning orsmoothing is applied for decision making. Other techniques for wakeworddetection, such as those known in the art, may also be used.

Once the wakeword is detected, the local device 102 may “wake” and begintransmitting audio data 1202 corresponding to input audio 1200 to theremote system 120 for speech processing. Audio data corresponding tothat audio may be sent to remote system 120 for routing to a recipientdevice or may be sent to the remote system 116 for speech processing forinterpretation of the included speech (either for purposes of enablingvoice-communications and/or for purposes of executing a command in thespeech). The audio data 1202 may include data corresponding to thewakeword, or the portion of the audio data corresponding to the wakewordmay be removed by the local device 102 prior to sending. Further, alocal device may “wake” upon detection of speech/spoken audio above athreshold, as described herein. Upon receipt by the remote system 120,an ASR module 136 may convert the audio data 1202 into text. The ASRtranscribes audio data into text data representing the words of thespeech contained in the audio data 1202. The text data may then be usedby other components for various purposes, such as executing systemcommands, inputting data, etc. A spoken utterance in the audio data isinput to a processor configured to perform ASR which then interprets theutterance based on the similarity between the utterance andpre-established language models 1254 stored in an ASR model knowledgebase (ASR Models Storage 1252). For example, the ASR process may comparethe input audio data with models for sounds (e.g., subword units orphonemes) and sequences of sounds to identify words that match thesequence of sounds spoken in the utterance of the audio data.

The different ways a spoken utterance may be interpreted (i.e., thedifferent hypotheses) may each be assigned a probability or a confidencescore representing the likelihood that a particular set of words matchesthose spoken in the utterance. The confidence score may be based on anumber of factors including, for example, the similarity of the sound inthe utterance to models for language sounds (e.g., an acoustic model1253 stored in an ASR Models Storage 1252), and the likelihood that aparticular word that matches the sounds would be included in thesentence at the specific location (e.g., using a language or grammarmodel). Thus, each potential textual interpretation of the spokenutterance (hypothesis) is associated with a confidence score. Based onthe considered factors and the assigned confidence score, the ASRprocess 136 outputs the most likely text recognized in the audio data.The ASR process may also output multiple hypotheses in the form of alattice or an N-best list with each hypothesis corresponding to aconfidence score or other score (such as probability scores, etc.).

The device or devices performing the ASR processing may include anacoustic front end (AFE) 1256 and a speech recognition engine 1258. Theacoustic front end (AFE) 1256 transforms the audio data from themicrophone into data for processing by the speech recognition engine1258. The speech recognition engine 1258 compares the speech recognitiondata with acoustic models 1253, language models 1254, and other datamodels and information for recognizing the speech conveyed in the audiodata. The AFE 1256 may reduce noise in the audio data and divide thedigitized audio data into frames representing time intervals for whichthe AFE 1256 determines a number of values, called features,representing the qualities of the audio data, along with a set of thosevalues, called a feature vector, representing the features/qualities ofthe audio data within the frame. Many different features may bedetermined, as known in the art, and each feature represents somequality of the audio that may be useful for ASR processing. A number ofapproaches may be used by the AFE to process the audio data, such asmel-frequency cepstral coefficients (MFCCs), perceptual linearpredictive (PLP) techniques, neural network feature vector techniques,linear discriminant analysis, semi-tied covariance matrices, or otherapproaches known to those of skill in the art.

The speech recognition engine 1258 may process the output from the AFE1256 with reference to information stored in speech/model storage(1252). Alternatively, post front-end processed data (such as featurevectors) may be received by the device executing ASR processing fromanother source besides the internal AFE. For example, the user devicemay process audio data into feature vectors (for example using anon-device AFE 1256) and transmit that information to a server across anetwork for ASR processing. Feature vectors may arrive at the remotesystem 120 encoded, in which case they may be decoded prior toprocessing by the processor executing the speech recognition engine1258.

The speech recognition engine 1258 attempts to match received featurevectors to language phonemes and words as known in the stored acousticmodels 1253 and language models 1254. The speech recognition engine 1258computes recognition scores for the feature vectors based on acousticinformation and language information. The acoustic information is usedto calculate an acoustic score representing a likelihood that theintended sound represented by a group of feature vectors matches alanguage phoneme. The language information is used to adjust theacoustic score by considering what sounds and/or words are used incontext with each other, thereby improving the likelihood that the ASRprocess will output speech results that make sense grammatically. Thespecific models used may be general models or may be modelscorresponding to a particular domain, such as music, banking, etc. Byway of example, a user utterance may be “Alexa, add the music to thekitchen,” or “Alexa, move the music to the kitchen,” or “Alexa, stop themusic in the kitchen.” The wake detection module may identify the wakeword, otherwise described as a trigger expression, “Alexa” in the userutterance and may “wake” based on identifying the wake word. Audio datacorresponding to the user utterance may be sent to the remote system 120where the speech recognition engine 1258 may identify, determine, and/orgenerate text data corresponding to the user utterance, here “add themusic to the kitchen,” “move the music to the kitchen,” or “stop themusic in the kitchen.”

The speech recognition engine 1258 may use a number of techniques tomatch feature vectors to phonemes, for example using Hidden MarkovModels (HMMs) to determine probabilities that feature vectors may matchphonemes. Sounds received may be represented as paths between states ofthe HMM and multiple paths may represent multiple possible text matchesfor the same sound.

Following ASR processing, the ASR results may be sent by the speechrecognition engine 1258 to other processing components, which may belocal to the device performing ASR and/or distributed across thenetwork(s). For example, ASR results in the form of a single textualrepresentation of the speech, an N-best list including multiplehypotheses and respective scores, lattice, etc. may be sent to theremote system 120, for natural language understanding (NLU) processing,such as conversion of the text into commands for execution, either bythe user device, by the remote system 120, or by another device (such asa server running a specific application like a search engine, etc.).

The device performing NLU processing 138 (e.g., server 120) may includevarious components, including potentially dedicated processor(s),memory, storage, etc. As shown in FIG. 12 , an NLU component 138 mayinclude a recognizer 1263 that includes a named entity recognition (NER)module 1262 which is used to identify portions of query text thatcorrespond to a named entity that may be recognizable by the system. Adownstream process called named entity resolution links a text portionto a specific entity known to the system. To perform named entityresolution, the system may utilize gazetteer information (1284 a-1284 n)stored in entity library storage 1282. The gazetteer information may beused for entity resolution, for example matching ASR results withdifferent entities (such as song titles, contact names, etc.) Gazetteersmay be linked to users (for example a particular gazetteer may beassociated with a specific user's music collection), may be linked tocertain domains (such as shopping), or may be organized in a variety ofother ways.

Generally, the NLU process takes textual input (such as processed fromASR 136 based on the utterance input audio 1200) and attempts to make asemantic interpretation of the text. That is, the NLU process determinesthe meaning behind the text based on the individual words and thenimplements that meaning. NLU processing 138 interprets a text string toderive an intent or a desired action from the user as well as thepertinent pieces of information in the text that allow a device (e.g.,device 102) to complete that action. For example, if a spoken utteranceis processed using ASR 136 and outputs the text “add music to thekitchen” the NLU process may determine that the user intended for theaudio being output by a device also be output by another deviceassociated with the identifier of kitchen.

The NLU may process several textual inputs related to the sameutterance. For example, if the ASR 136 outputs N text segments (as partof an N-best list), the NLU may process all N outputs to obtain NLUresults.

As will be discussed further below, the NLU process may be configured toparse and tag to annotate text as part of NLU processing. For example,for the text “move the music to the kitchen,” “move” may be tagged as acommand (to output audio on a device) and “kitchen” may be tagged as aspecific device to output the audio on instead of the previous device.

To correctly perform NLU processing of speech input, an NLU process 138may be configured to determine a “domain” of the utterance so as todetermine and narrow down which services offered by the endpoint device(e.g., remote system 120 or the user device) may be relevant. Forexample, an endpoint device may offer services relating to interactionswith a telephone service, a contact list service, a calendar/schedulingservice, a music player service, etc. Words in a single text query mayimplicate more than one service, and some services may be functionallylinked (e.g., both a telephone service and a calendar service mayutilize data from the contact list).

The named entity recognition (NER) module 1262 receives a query in theform of ASR results and attempts to identify relevant grammars andlexical information that may be used to construe meaning. To do so, theNLU module 138 may begin by identifying potential domains that mayrelate to the received query. The NLU storage 1273 includes a databaseof devices (1274 a-1274 n) identifying domains associated with specificdevices. For example, the user device may be associated with domains formusic, telephony, calendaring, contact lists, and device-specificcommunications, but not video. In addition, the entity library mayinclude database entries about specific services on a specific device,either indexed by Device ID, User ID, or Household ID, or some otherindicator.

In NLU processing, a domain may represent a discrete set of activitieshaving a common theme, such as “shopping,” “music,” “calendaring,” etc.As such, each domain may be associated with a particular recognizer1263, language model and/or grammar database (1276 a-1276 n), aparticular set of intents/actions (1278 a-1278 n), and a particularpersonalized lexicon (1286). Each gazetteer (1284 a-1284 n) may includedomain-indexed lexical information associated with a particular userand/or device. For example, the Gazetteer A (1284 a) includesdomain-index lexical information 1286 aa to 1286 an. A user'scontact-list lexical information might include the names of contacts.Since every user's contact list is presumably different, thispersonalized information improves entity resolution.

As noted above, in traditional NLU processing, a query may be processedapplying the rules, models, and information applicable to eachidentified domain. For example, if a query potentially implicates bothcommunications and, for example, music, the query may, substantially inparallel, be NLU processed using the grammar models and lexicalinformation for communications, and will be processed using the grammarmodels and lexical information for music. The responses based on thequery produced by each set of models is scored, with the overall highestranked result from all applied domains ordinarily selected to be thecorrect result.

An intent classification (IC) module 1264 parses the query to determinean intent or intents for each identified domain, where the intentcorresponds to the action to be performed that is responsive to thequery. Each domain is associated with a database (1278 a-1278 n) ofwords linked to intents. For example, a music intent database may linkwords and phrases such as “add,” “move,” “remove,” “quiet,” “volumeoff,” and “mute” to a “mute” intent. A voice-message intent database,meanwhile, may link words and phrases such as “send a message,” “send avoice message,” “send the following,” or the like. The IC module 1264identifies potential intents for each identified domain by comparingwords in the query to the words and phrases in the intents database1278. In some instances, the determination of an intent by the IC module1264 is performed using a set of rules or templates that are processedagainst the incoming text to identify a matching intent.

In order to generate a particular interpreted response, the NER 1262applies the grammar models and lexical information associated with therespective domain to actually recognize a mention of one or moreentities in the text of the query. In this manner, the NER 1262identifies “slots” or values (i.e., particular words in query text) thatmay be needed for later command processing. Depending on the complexityof the NER 1262, it may also label each slot with a type of varyinglevels of specificity (such as noun, place, city, artist name, songname, device identification, audio identification, audio-session queueidentification, or the like). Each grammar model 1276 includes the namesof entities (i.e., nouns) commonly found in speech about the particulardomain (i.e., generic terms), whereas the lexical information 1286 fromthe gazetteer 1284 is personalized to the user(s) and/or the device. Forinstance, a grammar model associated with the shopping domain mayinclude a database of words commonly used when people discuss shopping.

The intents identified by the IC module 1264 are linked todomain-specific grammar frameworks (included in 1276) with “slots” or“fields” to be filled with values. Each slot/field corresponds to aportion of the query text that the system believes corresponds to anentity. To make resolution more flexible, these frameworks wouldordinarily not be structured as sentences, but rather based onassociating slots with grammatical tags. For example, if “add the musicto the kitchen” is an identified intent, a grammar (1276) framework orframeworks may correspond to sentence structures such as “add{audio-session queue} to {kitchen}.”

For example, the NER module 1262 may parse the query to identify wordsas subject, object, verb, preposition, etc., based on grammar rulesand/or models, prior to recognizing named entities. The identified verbmay be used by the IC module 1264 to identify intent, which is then usedby the NER module 1262 to identify frameworks. A framework for theintent of “play a song,” meanwhile, may specify a list of slots/fieldsapplicable to play the identified “song” and any object modifier (e.g.,specifying a music collection from which the song should be accessed) orthe like. The NER module 1262 then searches the corresponding fields inthe domain-specific and personalized lexicon(s), attempting to matchwords and phrases in the query tagged as a grammatical object or objectmodifier with those identified in the database(s).

This process includes semantic tagging, which is the labeling of a wordor combination of words according to their type/semantic meaning.Parsing may be performed using heuristic grammar rules, or an NER modelmay be constructed using techniques such as hidden Markov models,maximum entropy models, log linear models, conditional random fields(CRF), and the like.

The frameworks linked to the intent are then used to determine whatdatabase fields should be searched to determine the meaning of thesephrases, such as searching a user's gazette for similarity with theframework slots. If the search of the gazetteer does not resolve theslot/field using gazetteer information, the NER module 1262 may searchthe database of generic words associated with the domain (in theknowledge base 1272). So, for instance, if the query was “add the musicto the kitchen,” after failing to determine which device corresponds tothe identify of “kitchen,” the NER component 1262 may search the domainvocabulary for device identifiers associated with the word “kitchen.” Inthe alternative, generic words may be checked before the gazetteerinformation, or both may be tried, potentially producing two differentresults.

The output data from the NLU processing (which may include tagged text,commands, etc.) may then be sent to a command processor 1207. Thedestination command processor 1207 may be determined based on the NLUoutput. For example, if the NLU output includes a command to send amessage, the destination command processor 1207 may be a message sendingapplication, such as one located on the user device or in a messagesending appliance, configured to execute a message sending command. Ifthe NLU output includes a search request, the destination commandprocessor 1207 may include a search engine processor, such as onelocated on a search server, configured to execute a search command.After the appropriate command is generated based on the intent of theuser, the command processor 1207 may provide some or all of thisinformation to a text-to-speech (TTS) engine. The TTS engine may thengenerate an actual audio file for outputting the audio data determinedby the command processor 1207 (e.g., “playing in the kitchen,” or “musicmoved to the kitchen”). After generating the file (or “audio data”), theTTS engine may provide this data back to the remote system 120.

The NLU operations of existing systems may take the form of amulti-domain architecture. Each domain (which may include a set ofintents and entity slots that define a larger concept such as music,books etc. as well as components such as trained models, etc. used toperform various NLU operations such as NER, IC, or the like) may beconstructed separately and made available to an NLU component 138 duringruntime operations where NLU operations are performed on text (such astext output from an ASR component 136). Each domain may have speciallyconfigured components to perform various steps of the NLU operations.

For example, in a NLU system, the system may include a multi-domainarchitecture consisting of multiple domains for intents/commandsexecutable by the system (or by other devices connected to the system),such as music, video, books, and information. The system may include aplurality of domain recognizers, where each domain may include its ownrecognizer 1263. Each recognizer may include various NLU components suchas an NER component 1262, IC module 1264 and other components such as anentity resolver, or other components.

For example, a messaging domain recognizer 1263-A (Domain A) may have anNER component 1262-A that identifies what slots (i.e., portions of inputtext) may correspond to particular words relevant to that domain. Thewords may correspond to entities such as (for the messaging domain) arecipient. An NER component 1262 may use a machine learning model, suchas a domain specific conditional random field (CRF) to both identify theportions corresponding to an entity as well as identify what type ofentity corresponds to the text portion. The messaging domain recognizer1263-A may also have its own intent classification (IC) component 1264-Athat determines the intent of the text assuming that the text is withinthe proscribed domain. An IC component may use a model, such as a domainspecific maximum entropy classifier to identify the intent of the text,where the intent is the action the user desires the system to perform.For this purpose, the remote system computing device 116 may include amodel training component. The model training component may be used totrain the classifier(s)/machine learning models discussed above.

As noted above, multiple devices may be employed in a single speechprocessing system. In such a multi-device system, each of the devicesmay include different components for performing different aspects of thespeech processing. The multiple devices may include overlappingcomponents. The components of the user device and the remote system 120,as illustrated herein are exemplary, and may be located in a stand-alonedevice or may be included, in whole or in part, as a component of alarger device or system, may be distributed across a network or multipledevices connected by a network, etc.

FIG. 13 illustrates a conceptual diagram of components of a speechprocessing system 120 associating audio output commands with multipledevices, including a command processor 1207 configured to generate acommand that the selected voice-enabled device uses to respond to a userutterance. As used with respect to FIG. 13 , a voice-enabled device mayinclude a communal device, such as the communal device 102 from FIG. 1 .As illustrated in FIG. 13 , the speech processing system 120, includingthe orchestration component 1324 and a speech processing component 132comprising the ASR component 136 and the NLU component 138, may becoupled to the targeting component 1334 and provide the targetingcomponent 1334 with the intent determined to be expressed in the userutterance. Further, the arbitration component 1330 may provide theranked list of devices to the targeting component 1334, as well asdevice indicators (e.g., IP addresses, devices names, etc.) for one ormore of the voice-enabled devices in the ranked list of devices. Thetargeting component 1334 may then perform techniques to determine atarget device (e.g., a device to perform the requested operation), andprovide various data to the command processor 1207. For instance, thetargeting component 1334 may provide the command processor 1207 withvarious device identifiers of the voice-enabled devices, the determinedtarget device, the determined intent and/or command, etc. By way ofexample, the targeting component 1334 may determine which devices to addto a grouping of device, which devices to remove from a grouping ofdevices, and/or which devices to move an audio-session to. Theassociation and dissociation of device states and/or audio-sessionqueues using the targeting component 1334 is described in more detailwith respect to FIG. 1 , above.

The command processor 1207 and/or NLU component 138 may determine adomain based on the intent and, based on this determination, route therequest corresponding to the audio data to the appropriate domainspeechlet, such as the illustrated domain speechlets 1342. The domainspeechlet 1342 may comprise any type of device or group of devices(e.g., hardware device, virtual devices or partitions, server, etc.),and may receive the text data and/or an intent associated with the audiosignals and may determine how to respond to the request. For instance,the intent for a command “add the music to the kitchen” may be routed toa music domain speechlet 1342, which controls devices, such as speakers,connected to the voice-enabled devices. The music domain speechlet 1342may determine a command to generate based on the intent of the user tooutput audio on a device associated with the kitchen identifier as whenas continuing to output the audio on another device that is currentlyoutputting the audio. Additionally, the music domain speechlet 1342 maydetermine additional content, such as audio data, to be output by one ofthe voice-enabled devices, such as “kitchen has been added to your audiosession.”

Various types of domain speechlets 1342 may be used to determine whichdevices to send commands to and/or to use in response to a userutterance, as well as the appropriate response and potential additionalcontent (e.g., audio data). For example, the domain speechlets 1342 mayinclude a third party skills domain speechlet 1342, which may handleintents associated with gaming, productivity, etc., a music domainspeechlet 1342, which may handle intents associated with music playrequests (e.g., Amazon Music, Pandora, Spotify, iHeart, etc.), and/or aninformation domain speechlet 1342, which may handle requests forinformation associated, for example, with the status of a particulardevice and/or content being utilized and/or output by a particulardevice and/or group of devices.

After the domain speechlet 1342 generates the appropriate command, whichmay be described herein as directive data, based on the intent of theuser, and/or provides additional content, such as audio data, to beoutput by one of the voice-enabled devices, the domain speechlet 1342may provide this information back to the speech system 120, which inturns provides some or all of this information to a text-to-speech (TTS)engine 142. The TTS engine 142 then generates an actual audio file foroutputting the second audio data determined by the domain speechlet1342. After generating the file (or “audio data”), the TTS engine 142may provide this data back to the speech system 120.

The speech system 120 may then publish (i.e., write) some or all of thisinformation to an event bus 1346. That is, the speech system 120 mayprovide information regarding the initial request (e.g., the speech, thetext, the domain/intent, etc.), the response to be provided to thevoice-enabled device, or any other information pertinent to theinteraction between the voice-enabled device and the speech processingsystem 120 to the event bus 1346.

Within the speech processing system 120, one or more components orservices may subscribe to the event bus 1346 so as to receiveinformation regarding interactions between user devices and the speechprocessing system 120. In the illustrated example, for instance, thedevice management component 1348 may subscribe to the event bus 1346and, thus, may monitor information regarding these interactions. In someexamples, monitoring information in the event bus 1346 may comprisecommunications between various components of the speech processingsystem 120. For example, the targeting component 1334 may monitor theevent bus 1346 to identify device state data for voice-enabled devices.In some examples, the event bus 1346 may “push” or send indications ofevents and/or device state data to the targeting component 1334.Additionally, or alternatively, the event bus 1346 may be “pulled” wherethe targeting component 1334 sends requests to the event bus 1346 toprovide an indication of device state data for a voice-enabled device.The event bus 1346 may store indications of the device states for thedevices, such as in a database (e.g., user registry 1336), and using thestored indications of the device states, send the device state data forvoice-enabled devices to the targeting component 1334. Thus, to identifydevice state data for a device, the targeting component 1334 may send arequest to the event bus 1346 (e.g., event component) to provide anindication of the device state data associated with a device, andreceive, from the event bus 1346, the device state data that wasrequested.

The device management component 1348 functions to monitor informationpublished to the event bus 1346 and identify events that may triggeraction. For instance, the device management component 1348 may identify(e.g., via filtering) those events that: (i) come from voice-enableddevices that are associated with secondary device(s) (e.g., havesecondary devices in their environments such as televisions, personalcomputing devices, etc.), and (ii) are associated with supplementalcontent (e.g., image data, video data, etc.). The device managementcomponent 1348 may reference the user registry 1336 to determine whichvoice-enabled devices are associated with secondary devices, as well asdetermine device types, states, and other capabilities of thesesecondary devices. For instance, the device management component 1348may determine, from the information published to the event bus 1346, anidentifier associated with the voice-enabled device making thecorresponding request or the voice-enabled device selected to respond toor act upon the user utterance. The device management component 1348 mayuse this identifier to identify, from the user registry 1336, a useraccount associated with the voice-enabled device. The device managementcomponent 1348 may also determine whether any secondary devices havebeen registered with the identified user account, as well ascapabilities of any such secondary devices, such as how the secondarydevices are configured to communicate (e.g., via WiFi, short-rangewireless connections, etc.), the type of content the devices are able tooutput (e.g., audio, video, still images, flashing lights, etc.), andthe like. As used herein, the secondary device may include one or moreof the communal devices 102 from FIG. 1 . For example, the secondarydevices may include speakers that may wirelessly communicate with thevoice-enabled device and/or one or more other secondary devices, such aspersonal devices.

The device management component 1348 may determine whether a particularevent identified is associated with supplemental content. That is, thedevice management component 1348 may write, to a datastore, indicationsof which types of events and/or which primary content or responses areassociated with supplemental content. In some instances, the speechprocessing system 120 may provide access to third-party developers toallow the developers to register supplemental content for output onsecondary devices for particular events and/or primary content. Forexample, if a voice-enabled device is to output that the weather willinclude thunder and lightning, the device management component 1348 maystore an indication of supplemental content such as thunder sounds,pictures/animations of lightning and the like. In another example, if avoice-enabled device is outputting information about a particular fact(e.g., “a blue whale is the largest mammal on earth . . . ”), then asecondary device, such as television, may be configured to providesupplemental content such as a video or picture of a blue whale. Inanother example, if a voice-enabled device is outputting audio, then asecond device, such as a speaker, may be configured to also output theaudio based at least in part on a user utterance representing a requestto add the secondary device to the audio session. In these and otherexamples, the device management component 1348 may store an associationbetween the primary response or content (e.g., outputting of informationregarding the world's largest mammal) and corresponding supplementalcontent (e.g., the audio data, image data, or the like). In someinstances, the device management component 1348 may also indicate whichtypes of secondary devices are to output which supplemental content. Forinstance, in the instant example, the device management component 1348may store an indication that secondary devices of a class type “tablet”are to output a picture of a blue whale. In these and other instances,meanwhile, the device management component 1348 may store thesupplemental content in association with secondary-device capabilities(e.g., devices with speakers output the audio commentary, devices withscreens output the image, etc.).

The device management component 1348 may also determine how to transmitresponse and/or supplement content (and/or information acquiring thecontent) to the voice-enabled devices and/or the secondary devices. Tomake this determination, the device management component 1348 maydetermine a device type of the voice-enabled devices and/or secondarydevices, capabilities of the device(s), or the like, potentially asstored in the user registry 1336. In some instances, the devicemanagement component 1348 may determine that a particular device is ableto communicate directly with the speech processing system 1210 (e.g.,over WiFi) and, thus, the device management component 1348 may providethe response and/or content directly over a network 118 to the secondarydevice (potentially via the speech system 120). In another example, thedevice management component 1348 may determine that a particularsecondary device is unable to communicate directly with the speechprocessing system 120, but instead is configured to communicate with avoice-enabled device in its environment over short-range wirelessnetworks. As such, the device management component 1348 may provide thesupplement content (or information) to the speech system 120, which inturn may send this to the voice-enabled device, which may send theinformation over a short-range network to the secondary device.

In addition to the above, the device management component 1348 mayinclude the media-grouping state controller 140. The media-groupingstate controller 140 may be configured to perform the same or similaroperations as the media-grouping state controller 140 described withrespect to FIG. 1 .

The computer-readable media 132 may further include the user registry1336 that includes data regarding user profiles as described herein. Theuser registry 1336 may be located part of, or proximate to, the speechprocessing system 120, or may otherwise be in communication with variouscomponents, for example over the network 118. The user registry 1336 mayinclude a variety of information related to individual users, accounts,etc. that interact with the voice-enabled devices, and the speechprocessing system 120. For illustration, the user registry 1336 mayinclude data regarding the devices associated with particular individualuser profiles. Such data may include user or device identifier (ID) andinternet protocol (IP) address information for different devices as wellas names by which the devices may be referred to by a user. Furtherqualifiers describing the devices may also be listed along with adescription of the type of object of the device. Further, the userregistry 1336 may store indications of associations between variousvoice-enabled devices and/or secondary device, such as virtual clustersof devices, states of devices, and associations between devices andaudio-session queues. The user registry 1336 may represent clusters ofdevices and/or as single devices that can receive commands and dispersethe commands to each device and/or in the cluster. In some examples, thevirtual cluster of devices may be represented as a single device whichis determined as being capable, or not capable (e.g., offline), ofperforming a command in a user utterance. A virtual cluster of devicesmay generally correspond to a stored grouping of devices, or a storedassociation between a group of devices.

In some examples, the device state for devices associated with a useraccount may indicate a current state of the device. In this way, thecommand processor 1207 and/or the domain speechlets 1342 may determine,based on the stored device states in the user registry 1336, a currentdevice state of the voice-enabled devices. Rather than receiving devicestates for the voice-enabled devices, in metadata, the device states mayalready have been determined or received and stored in the user registry1336. Further, the user registry 1336 may provide indications of variouspermission levels depending on the user. As an example, the speechsystem 120 may perform speaker recognition on audio signals to determinean identity of the speaker. If the speaker is a child, for instance, thechild profile may have permission restrictions where they are unable torequest audio to be output via certain devices and/or to output certainaudio on one or more of the devices, for example. Conversely, a parentprofile may be able to direct output of audio without restrictions.

In some examples, to determine the device state, the event bus 1346 maypublish different events which indicate device states to variousentities or components that subscribe to the event bus 1346. Forinstance, if an event of “play music” occurs for a voice-enabled device,the event bus 1346 may publish the indication of this event, and thusthe device state of outputting audio may be determined for thevoice-enabled device. Thus, various components, such as the targetingcomponent 1334, may be provided with indications of the various devicestates via the event bus 1346. The event bus 1346 may further storeand/or update device states for the voice-enabled devices in the userregistry 1336. The components of the speech processing system 120 mayquery the user registry 1336 to determine device states.

A particular user profile may include a variety of data that may be usedby the system 120. For example, a user profile may include informationabout what voice-enabled devices are associated with the user and/oruser profile. The user profile may further indicate an IP address foreach of the devices associated with the user and/or user profile, userIDs for the devices, indications of the types of devices, and currentdevice states for the devices.

While the foregoing invention is described with respect to the specificexamples, it is to be understood that the scope of the invention is notlimited to these specific examples. Since other modifications andchanges varied to fit particular operating requirements and environmentswill be apparent to those skilled in the art, the invention is notconsidered limited to the example chosen for purposes of disclosure, andcovers all changes and modifications which do not constitute departuresfrom the true spirit and scope of this invention.

Although the application describes embodiments having specificstructural features and/or methodological acts, it is to be understoodthat the claims are not necessarily limited to the specific features oracts described. Rather, the specific features and acts are merelyillustrative some embodiments that fall within the scope of the claimsof the application.

1. (canceled)
 2. A device comprising: one or more processors; andnon-transitory computer-readable media storing instructions that, whenexecuted by the one or more processors, causes the one or moreprocessors to perform operations comprising: receiving, at the device,first input data requesting that audio be output by a target deviceassociated with the device, wherein the device is configured towirelessly receive audio data corresponding to the audio from one ormore audio streaming services; determining, from user account dataassociated with the device, multiple audio-output devices that have beenconfigured to communicate with the device, wherein at least one of themultiple audio-output devices lacks a connection to the one or moreaudio streaming services; selecting the target device from the multipleaudio-output devices based at least in part on the first input data;sending the audio data from the device to the target device; and sendinga first command from the device to the target device, the first commandconfigured to cause the target device to transition to a state where theaudio data from the one or more audio streaming services is utilized bythe target device instead of the device to output the audio.
 3. Thedevice of claim 2, wherein selecting the target device is based at leastin part on the user account data indicating that the target device hasbeen configured to transition device states in response to commands sentto the target device from the device.
 4. The device of claim 2, wherein:the first input data represents speech input received at a speechinterface device; and the first input data indicates that the speechinterface device has received a request to cause the target device tooutput the audio.
 5. The device of claim 2, the operations furthercomprising: associating the device with the target device based at leastin part on the target device being physically connected to the device;and wherein selecting the target device comprises selecting the targetdevice based at least in part on the device being physically connectedto the target device.
 6. The device of claim 2, wherein: the first inputdata is received from a personal device running an applicationassociated with the device; the first input data indicates a selectionof the device as the target device; and selecting the target devicecomprises selecting the target device based at least in part on thedevice being configured to cause the target device to output the audio.7. The device of claim 2, the operations further comprising: receivingsecond input data requesting to cause at least one of the multipleaudio-output devices to output the audio in time synchronization withoutput of the audio by the target device; and sending, to the at leastone of the multiple audio-output devices and based at least in part onthe second input data: the audio data; and a second command to outputthe audio in time synchronization with output of the audio by the targetdevice.
 8. The device of claim 2, wherein: the device excludes aspeaker; and selecting the target device comprises selecting the targetdevice based at least in part on the target device being configured tooutput audio sent to the device.
 9. The device of claim 2, theoperations further comprising: receiving an indication that second inputdata has been received at the target device requesting to alter outputof the audio at the target device; and causing, at the device, a queueassociated with the audio data to be altered based at least in part onreceiving the indication.
 10. The device of claim 2, the operationsfurther comprising: receiving second input data requesting to transferoutput of the audio from the target device to an additional device ofthe multiple audio-output devices; and based at least in part onreceiving the second input data: sending a second command to theadditional device, the second command configured to cause the additionaldevice to output the audio; and sending a third command to the targetdevice, the third command configured to cause the target device to ceaseoutput of the audio.
 11. The device of claim 2, the operations furthercomprising associating a state of the device with the target device suchthat, when a state change occurs for the device, the state change iscaused to occur for the target device.
 12. A method comprising:receiving, at a first device, first input data requesting that contentbe output by a second device associated with the first device, whereinthe first device is configured to wirelessly receive content datacorresponding to the content from one or more external services;determining, from user account data associated with the first device,multiple devices that have been configured to communicate with the firstdevice, wherein at least one of the multiple devices lacks a connectionto the one or more external services; selecting the second device fromthe multiple devices based at least in part on the first input data;sending the content data from the first device to the second device; andsending a first command from the first device to the second device, thefirst command configured to cause the second device to transition to astate where the content data from the one or more external services isutilized by the second device instead of the first device to output thecontent.
 13. The method of claim 12, wherein selecting the second deviceis based at least in part on the user account data indicating that thesecond device has been configured to transition device states inresponse to commands sent to the second device from the first device.14. The method of claim 12, wherein: the first input data representsspeech input received at a speech interface device; and the first inputdata indicates that the speech interface device has received a requestto cause the second device to output the content.
 15. The method ofclaim 12, further comprising: associating the first device with thesecond device based at least in part on the second device beingphysically connected to the first device; and wherein selecting thesecond device comprises selecting the second device based at least inpart on the first device being physically connected to the seconddevice.
 16. The method of claim 12, wherein: the first input data isreceived from a third device running an application associated with thefirst device; the first input data indicates a selection of the firstdevice as a target device; and selecting the second device comprisesselecting the second device based at least in part on the first devicebeing configured to cause the second device to output the content. 17.The method of claim 12, further comprising: receiving second input datarequesting to cause at least one of the multiple devices to output thecontent in time synchronization with output of the content by the seconddevice; and sending, to the at least one of the multiple devices andbased at least in part on the second input data: the content data; and asecond command to output the content in time synchronization with outputof the content by the second device.
 18. The method of claim 12,wherein: the first device excludes a speaker; and selecting the seconddevice comprises selecting the second device based at least in part onthe second device being configured to output content sent to the firstdevice.
 19. The method of claim 12, further comprising: receiving anindication that second input data has been received at the second devicerequesting to alter output of the content at the second device; andcausing, at the first device, a queue associated with the content datato be altered based at least in part on receiving the indication. 20.The method of claim 12, further comprising: receiving second input datarequesting to transfer output of the content from the second device to athird device of the multiple devices; and based at least in part onreceiving the second input data: sending a second command to the thirddevice, the second command configured to cause the third device tooutput the content; and sending a third command to the second device,the third command configured to cause the second device to cease outputof the content.
 21. The method of claim 12, further comprisingassociating a state of the first device with the second device suchthat, when a state change occurs for the first device, the state changeis caused to occur for the second device.